Source author record

Todd Mullen

Todd Mullen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Retrieval math.CO

Catalog footprint

What is connected

2works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

A statistical significance testing approach for measuring term burstiness with applications to domain-specific terminology extraction

A term in a corpus is said to be ``bursty'' (or overdispersed) when its occurrences are concentrated in few out of many documents. In this paper, we propose Residual Inverse Collection Frequency (RICF), a statistical significance test inspired heuristic for quantifying term burstiness. The chi-squared test is, to our knowledge, the sole test of statistical significance among existing term burstiness measures. Chi-squared test term burstiness scores are computed from the collection frequency statistic (i.e., the proportion that a specified term constitutes in relation to all terms within a corpus). However, the document frequency of a term (i.e., the proportion of documents within a corpus in which a specific term occurs) is exploited by certain other widely used term burstiness measures. RICF addresses this shortcoming of the chi-squared test by virtue of its term burstiness scores systematically incorporating both the collection frequency and document frequency statistics. We evaluate the RICF measure on a domain-specific technical terminology extraction task using the GENIA Term corpus benchmark, which comprises 2,000 annotated biomedical article abstracts. RICF generally outperformed the chi-squared test in terms of precision at k score with percent improvements of 0.00% (P@10), 6.38% (P@50), 6.38% (P@100), 2.27% (P@500), 2.61% (P@1000), and 1.90% (P@5000). Furthermore, RICF performance was competitive with the performances of other well-established measures of term burstiness. Based on these findings, we consider our contributions in this paper as a promising starting point for future exploration in leveraging statistical significance testing in text analysis.

preprint2020arXiv

Diffusion: Quiescence and Perturbation

Originally proposed by Duffy et al., Diffusion is a variant of chip-firing in which chips from flow from places of high concentration to places of low concentration. In the variant, Perturbation Diffusion, the first step involves a "perturbation" in which some number of vertices send chips to each of their respective neighbours even though the rules of Diffusion only permit for chips to be sent from richer vertices to poorer vertices. Perturbation Diffusion allows us to expand our study of Diffusion by asking new questions such as ``Given an initial configuration, which vertices, when perturbed, will return the initial configuration after some number of steps in Diffusion." We give some results in this paper that begin to answer this question in the specific case of every vertex initially having 0 chips. We characterize some of the ways a graph can reach such a state in Perturbation Diffusion before focusing on paths in particular with more specific results.