Source author record

Ian Barnett

Ian Barnett appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

7works
7topics
4close collaborators

Actions

Connect this record

Log in to claim

Research graph

See the researcher in context

Open full explorer

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2022arXiv

Autoregressive Mixture Models for Serial Correlation Clustering of Time Series Data

Clustering time series into similar groups can improve models by combining information across like time series. While there is a well developed body of literature for clustering of time series, these approaches tend to generate clusters independently of model training which can lead to poor model fit. We propose a novel distributed approach that simultaneously clusters and fits autoregression models for groups of similar individuals. We apply a Wishart mixture model so as to cluster individuals while modeling the corresponding autocovariance matrices at the same time. The fitted Wishart scale matrices map to cluster-level autoregressive coefficients through the Yule-Walker equations, fitting robust parsimonious autoregressive mixture models. This approach is able to discern differences in underlying autocorrelation variation of time series in settings with large heterogeneous datasets. We prove consistency of our cluster membership estimator, asymptotic distributions of coefficients and compare our approach against competing methods through simulation as well as by fitting a COVID-19 forecast model.

preprint2022arXiv

Confidence Intervals for the Number of Components in Factor Analysis and Principal Components Analysis via Subsampling

Factor analysis (FA) and principal component analysis (PCA) are popular statistical methods for summarizing and explaining the variability in multivariate datasets. By default, FA and PCA assume the number of components or factors to be known \emph{a priori}. However, in practice the users first estimate the number of factors or components and then perform FA and PCA analyses using the point estimate. Therefore, in practice the users ignore any uncertainty in the point estimate of the number of factors or components. For datasets where the uncertainty in the point estimate is not ignorable, it is prudent to perform FA and PCA analyses for the range of positive integer values in the confidence intervals for the number of factors or components. We address this problem by proposing a subsampling-based data-intensive approach for estimating confidence intervals for the number of components in FA and PCA. We study the coverage probability of the proposed confidence intervals and provide non-asymptotic theoretical guarantees concerning the accuracy of the confidence intervals. As a byproduct, we derive the first-order \emph{Edgeworth expansion} for spiked eigenvalues of the sample covariance matrix when the data matrix is generated under a factor model. We also demonstrate the usefulness of our approach through numerical simulations and by applying our approach for estimating confidence intervals for the number of factors of the genotyping dataset of the Human Genome Diversity Project.

preprint2022arXiv

Multiple Hypothesis Testing To Estimate The Number of Communities in Sparse Stochastic Block Models

Network-based clustering methods frequently require the number of communities to be specified \emph{a priori}. Moreover, most of the existing methods for estimating the number of communities assume the number of communities to be fixed and not scale with the network size $n$. The few methods that assume the number of communities to increase with the network size $n$ are only valid when the average degree $d$ of a network grows at least as fast as $O(n)$ (i.e., the dense case) or lies within a narrow range. This presents a challenge in clustering large-scale network data, particularly when the average degree $d$ of a network grows slower than the rate of $O(n)$ (i.e., the sparse case). To address this problem, we proposed a new sequential procedure utilizing multiple hypothesis tests and the spectral properties of Erdös Rényi graphs for estimating the number of communities in sparse stochastic block models (SBMs). We prove the consistency of our method for sparse SBMs for a broad range of the sparsity parameter. As a consequence, we discover that our method can estimate the number of communities $K^{(n)}_{\star}$ with $K^{(n)}_{\star}$ increasing at the rate as high as $O(n^{(1 - 3γ)/(4 - 3γ)})$, where $d = O(n^{1 - γ})$. Moreover, we show that our method can be adapted as a stopping rule in estimating the number of communities in binary tree stochastic block models. We benchmark the performance of our method against other competing methods on six reference single-cell RNA sequencing datasets. Finally, we demonstrate the usefulness of our method through numerical simulations and by using it for clustering real single-cell RNA-sequencing datasets.

preprint2020arXiv

The Asymptotic Distribution of Modularity in Weighted Signed Networks

Modularity is a popular metric for quantifying the degree of community structure within a network. The distribution of the largest eigenvalue of a network's edge weight or adjacency matrix is well studied and is frequently used as a substitute for modularity when performing statistical inference. However, we show that the largest eigenvalue and modularity are asymptotically uncorrelated, which suggests the need for inference directly on modularity itself when the network size is large. To this end, we derive the asymptotic distributions of modularity in the case where the network's edge weight matrix belongs to the Gaussian Orthogonal Ensemble, and study the statistical power of the corresponding test for community structure under some alternative model. We empirically explore universality extensions of the limiting distribution and demonstrate the accuracy of these asymptotic distributions through type I error simulations. We also compare the empirical powers of the modularity based tests with some existing methods. Our method is then used to test for the presence of community structure in two real data applications.

preprint2016arXiv

Feature-Based Classification of Networks

Network representations of systems from various scientific and societal domains are neither completely random nor fully regular, but instead appear to contain recurring structural building blocks. These features tend to be shared by networks belonging to the same broad class, such as the class of social networks or the class of biological networks. At a finer scale of classification within each such class, networks describing more similar systems tend to have more similar features. This occurs presumably because networks representing similar purposes or constructions would be expected to be generated by a shared set of domain specific mechanisms, and it should therefore be possible to classify these networks into categories based on their features at various structural levels. Here we describe and demonstrate a new, hybrid approach that combines manual selection of features of potential interest with existing automated classification methods. In particular, selecting well-known and well-studied features that have been used throughout social network analysis and network science and then classifying with methods such as random forests that are of special utility in the presence of feature collinearity, we find that we achieve higher accuracy, in shorter computation time, with greater interpretability of the network classification results.

preprint2016arXiv

Social and Spatial Clustering of People at Humanity's Largest Gathering

Macroscopic behavior of scientific and societal systems results from the aggregation of microscopic behaviors of their constituent elements, but connecting the macroscopic with the microscopic in human behavior has traditionally been difficult. Manifestations of homophily, the notion that individuals tend to interact with others who resemble them, have been observed in many small and intermediate size settings. However, whether this behavior translates to truly macroscopic levels, and what its consequences may be, remains unknown. Here, we use call detail records (CDRs) to examine the population dynamics and manifestations of social and spatial homophily at a macroscopic level among the residents of 23 states of India at the Kumbh Mela, a 3-month-long Hindu festival. We estimate that the festival was attended by 61 million people, making it the largest gathering in the history of humanity. While we find strong overall evidence for both types of homophily for residents of different states, participants from low-representation states show considerably stronger propensity for both social and spatial homophily than those from high-representation states. These manifestations of homophily are amplified on crowded days, such as the peak day of the festival, which we estimate was attended by 25 million people. Our findings confirm that homophily, which here likely arises from social influence, permeates all scales of human behavior.

preprint2015arXiv

Change Point Detection in Correlation Networks

Many systems of interacting elements can be conceptualized as networks, where network nodes represent the elements and network ties represent interactions between the elements. In systems where the underlying network evolves in time, it is useful to determine the points in time where the network structure changes significantly as these may correspond also to functional change points. We propose a method for detecting these change points in correlation networks that, unlike previous change point detection methods designed for time series data, requires no distributional assumptions. We investigate the difficulty of change point detection near the boundaries of data in correlation networks and demonstrate the power of our method and a competing method through simulation. We also show the generalizable nature of our method by applying it to stock price data as well as fMRI data.