Source author record

Paul Sheridan

Paul Sheridan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications physics.data-an Computation cond-mat.stat-mech Digital Libraries Information Retrieval physics.soc-ph Social and Information Networks

Catalog footprint

What is connected

5works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

A statistical significance testing approach for measuring term burstiness with applications to domain-specific terminology extraction

A term in a corpus is said to be ``bursty'' (or overdispersed) when its occurrences are concentrated in few out of many documents. In this paper, we propose Residual Inverse Collection Frequency (RICF), a statistical significance test inspired heuristic for quantifying term burstiness. The chi-squared test is, to our knowledge, the sole test of statistical significance among existing term burstiness measures. Chi-squared test term burstiness scores are computed from the collection frequency statistic (i.e., the proportion that a specified term constitutes in relation to all terms within a corpus). However, the document frequency of a term (i.e., the proportion of documents within a corpus in which a specific term occurs) is exploited by certain other widely used term burstiness measures. RICF addresses this shortcoming of the chi-squared test by virtue of its term burstiness scores systematically incorporating both the collection frequency and document frequency statistics. We evaluate the RICF measure on a domain-specific technical terminology extraction task using the GENIA Term corpus benchmark, which comprises 2,000 annotated biomedical article abstracts. RICF generally outperformed the chi-squared test in terms of precision at k score with percent improvements of 0.00% (P@10), 6.38% (P@50), 6.38% (P@100), 2.27% (P@500), 2.61% (P@1000), and 1.90% (P@5000). Furthermore, RICF performance was competitive with the performances of other well-established measures of term burstiness. Based on these findings, we consider our contributions in this paper as a promising starting point for future exploration in leveraging statistical significance testing in text analysis.

preprint2018arXiv

A Preferential Attachment Paradox: How Preferential Attachment Combines with Growth to Produce Networks with Log-normal In-degree Distributions

Every network scientist knows that preferential attachment combines with growth to produce networks with power-law in-degree distributions. How, then, is it possible for the network of American Physical Society journal collection citations to enjoy a log-normal citation distribution when it was found to have grown in accordance with preferential attachment? This anomalous result, which we exalt as the preferential attachment paradox, has remained unexplained since the physicist Sidney Redner first made light of it over a decade ago. Here we propose a resolution. The chief source of the mischief, we contend, lies in Redner having relied on a measurement procedure bereft of the accuracy required to distinguish preferential attachment from another form of attachment that is consistent with a log-normal in-degree distribution. There was a high-accuracy measurement procedure in use at the time, but it would have have been difficult to use it to shed light on the paradox, due to the presence of a systematic error inducing design flaw. In recent years the design flaw had been recognised and corrected. We show that the bringing of the newly corrected measurement procedure to bear on the data leads to a resolution of the paradox.

preprint2018arXiv

PAFit: an R Package for the Non-Parametric Estimation of Preferential Attachment and Node Fitness in Temporal Complex Networks

Many real-world systems are profitably described as complex networks that grow over time. Preferential attachment and node fitness are two simple growth mechanisms that not only explain certain structural properties commonly observed in real-world systems, but are also tied to a number of applications in modeling and inference. While there are statistical packages for estimating various parametric forms of the preferential attachment function, there is no such package implementing non-parametric estimation procedures. The non-parametric approach to the estimation of the preferential attachment function allows for comparatively finer-grained investigations of the `rich-get-richer' phenomenon that could lead to novel insights in the search to explain certain nonstandard structural properties observed in real-world networks. This paper introduces the R package PAFit, which implements non-parametric procedures for estimating the preferential attachment function and node fitnesses in a growing network, as well as a number of functions for generating complex networks from these two mechanisms. The main computational part of the package is implemented in C++ with OpenMP to ensure scalability to large-scale networks. We first introduce the main functionalities of PAFit through simulated examples, and then use the package to analyze a collaboration network between scientists in the field of complex networks. The results indicate the joint presence of `rich-get-richer' and `fit-get-richer' phenomena in the collaboration network. The estimated attachment function is observed to be near-linear, which we interpret as meaning that the chance an author gets a new collaborator is proportional to their current number of collaborators. Furthermore, the estimated author fitnesses reveal a host of familiar faces from the complex networks community among the field's topmost fittest network scientists.

preprint2018arXiv

Theme Enrichment Analysis: A Statistical Test for Identifying Significantly Enriched Themes in a List of Stories with an Application to the Star Trek Television Franchise

In this paper, we describe how the hypergeometric test can be used to determine whether a given theme of interest occurs in a storyset at a frequency more than would be expected by chance. By a storyset we mean simply a list of stories defined according to a common attribute (e.g., author, movement, period). The test works roughly as follows: Given a background storyset and a sub-storyset of interest, the test determines whether a given theme is over-represented in the sub-storyset, based on comparing the proportions of stories in the sub-storyset and background storyset featuring the theme. A storyset is said to be "enriched" for a theme with respect to a particular background storyset, when the theme is identified as being significantly over-represented by the test. Furthermore, we introduce here a toy dataset consisting of 280 manually themed Star Trek television franchise episodes. As a proof of concept, we use the hypergeometric test to analyze the Star Trek stories for enriched themes. The hypergeometric testing approach to theme enrichment analysis is implemented for the Star Trek thematic dataset in the R package stoRy. A related R Shiny web application can be found at https://github.com/theme-ontology/shiny-apps.

preprint2008arXiv

A preferential attachment model with Poisson growth for scale-free networks

We propose a scale-free network model with a tunable power-law exponent. The Poisson growth model, as we call it, is an offshoot of the celebrated model of Barabási and Albert where a network is generated iteratively from a small seed network; at each step a node is added together with a number of incident edges preferentially attached to nodes already in the network. A key feature of our model is that the number of edges added at each step is a random variable with Poisson distribution, and, unlike the Barabási-Albert model where this quantity is fixed, it can generate any network. Our model is motivated by an application in Bayesian inference implemented as Markov chain Monte Carlo to estimate a network; for this purpose, we also give a formula for the probability of a network under our model.

Paul Sheridan

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

A statistical significance testing approach for measuring term burstiness with applications to domain-specific terminology extraction

A Preferential Attachment Paradox: How Preferential Attachment Combines with Growth to Produce Networks with Log-normal In-degree Distributions

PAFit: an R Package for the Non-Parametric Estimation of Preferential Attachment and Node Fitness in Temporal Complex Networks

Theme Enrichment Analysis: A Statistical Test for Identifying Significantly Enriched Themes in a List of Stories with an Application to the Star Trek Television Franchise

A preferential attachment model with Poisson growth for scale-free networks