Source author record

Eric D. Kolaczyk

Eric D. Kolaczyk appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications math.ST Statistics Theory Molecular Networks Networking and Internet Architecture Neurons and Cognition physics.soc-ph cond-mat.stat-mech math.PR Quantitative Methods Social and Information Networks

Catalog footprint

What is connected

19works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Causal Inference under Network Interference with Noise

Increasingly, there is a marked interest in estimating causal effects under network interference due to the fact that interference manifests naturally in networked experiments. However, network information generally is available only up to some level of error. We study the propagation of such errors to estimators of average causal effects under network interference. Specifically, assuming a four-level exposure model and Bernoulli random assignment of treatment, we characterize the impact of network noise on the bias and variance of standard estimators in homogeneous and inhomogeneous networks. In addition, we propose method-of-moments estimators for bias reduction where a minimal number of network replicates are available. We show our estimators are asymptotically normal and provide confidence intervals for quantifying the uncertainty in these estimates. We illustrate the practical performance of our estimators through simulation studies in British secondary school contact networks.

preprint2022arXiv

Disentangling positive and negative partisanship in social media interactions using a coevolving latent space network with attractors model

We develop a broadly applicable class of coevolving latent space network with attractors (CLSNA) models, where nodes represent individual social actors assumed to lie in an unknown latent space, edges represent the presence of a specified interaction between actors, and attractors are added in the latent level to capture the notion of attractive and repulsive forces. We apply the CLSNA models to understand the dynamics of partisan polarization on social media, where we expect Republicans and Democrats to increasingly interact with their own party and disengage with the opposing party. Using longitudinal social networks from the social media platforms Twitter and Reddit, we investigate the relative contributions of positive (attractive) and negative (repulsive) forces among political elites and the public, respectively. Our goals are to disentangle the positive and negative forces within and between parties and explore if and how they change over time. Our analysis confirms the existence of partisan polarization in social media interactions among both political elites and the public. Moreover, while positive partisanship is the driving force of interactions across the full periods of study for both the public and Democratic elites, negative partisanship has come to dominate Republican elites' interactions since the run-up to the 2016 presidential election.

preprint2021arXiv

Bayesian classification, anomaly detection, and survival analysis using network inputs with application to the microbiome

While the study of a single network is well-established, technological advances now allow for the collection of multiple networks with relative ease. Increasingly, anywhere from several to thousands of networks can be created from brain imaging, gene co-expression data, or microbiome measurements. And these networks, in turn, are being looked to as potentially powerful features to be used in modeling. However, with networks being non-Euclidean in nature, how best to incorporate them into standard modeling tasks is not obvious. In this paper, we propose a Bayesian modeling framework that provides a unified approach to binary classification, anomaly detection, and survival analysis with network inputs. We encode the networks in the kernel of a Gaussian process prior via their pairwise differences and we discuss several choices of provably positive definite kernel that can be plugged into our models. Although our methods are widely applicable, we are motivated here in particular by microbiome research (where network analysis is emerging as the standard approach for capturing the interconnectedness of microbial taxa across both time and space) and its potential for reducing preterm delivery and improving personalization of prenatal care.

preprint2021arXiv

Inferring the Type of Phase Transitions Undergone in Epileptic Seizures Using Random Graph Hidden Markov Models for Percolation in Noisy Dynamic Networks

In clinical neuroscience, epileptic seizures have been associated with the sudden emergence of coupled activity across the brain. The resulting functional networks - in which edges indicate strong enough coupling between brain regions - are consistent with the notion of percolation, which is a phenomenon in complex networks corresponding to the sudden emergence of a giant connected component. Traditionally, work has concentrated on noise-free percolation with a monotonic process of network growth, but real-world networks are more complex. We develop a class of random graph hidden Markov models (RG-HMMs) for characterizing percolation regimes in noisy, dynamically evolving networks in the presence of edge birth and edge death, as well as noise. This class is used to understand the type of phase transitions undergone in a seizure, and in particular, distinguishing between different percolation regimes in epileptic seizures. We develop a hypothesis testing framework for inferring putative percolation mechanisms. As a necessary precursor, we present an EM algorithm for estimating parameters from a sequence of noisy networks only observed at a longitudinal subsampling of time points. Our results suggest that different types of percolation can occur in human seizures. The type inferred may suggest tailored treatment strategies and provide new insights into the fundamental science of epilepsy.

preprint2021arXiv

Network Recovery from Unlabeled Noisy Samples

There is a growing literature on the statistical analysis of multiple networks in which the network is the fundamental data object. However, most of this work requires networks on a shared set of labeled vertices. In this work, we consider the question of recovering a parent network based on noisy unlabeled samples. We identify a specific regime in the noisy network literature for recovery that is asymptotically unbiased and computationally tractable based on a three-stage recovery procedure: first, we align the networks via a sequential pairwise graph matching procedure; next, we compute the sample average of the aligned networks; finally, we obtain an estimate of the parent by thresholding the sample average. Previous work on multiple unlabeled networks is only possible for trivial networks due to the complexity of brute-force computations.

preprint2020arXiv

Estimation of subgraph density in noisy networks

While it is common practice in applied network analysis to report various standard network summary statistics, these numbers are rarely accompanied by uncertainty quantification. Yet any error inherent in the measurements underlying the construction of the network, or in the network construction procedure itself, necessarily must propagate to any summary statistics reported. Here we study the problem of estimating the density of an arbitrary subgraph, given a noisy version of some underlying network as data. Under a simple model of network error, we show that consistent estimation of such densities is impossible when the rates of error are unknown and only a single network is observed. Accordingly, we develop method-of-moment estimators of network subgraph densities and error rates for the case where a minimal number of network replicates are available. These estimators are shown to be asymptotically normal as the number of vertices increases to infinity. We also provide confidence intervals for quantifying the uncertainty in these estimates based on the asymptotic normality. To construct the confidence intervals, a new and non-standard bootstrap method is proposed to compute asymptotic variances, which is infeasible otherwise. We illustrate the proposed methods in the context of gene coexpression networks.

preprint2016arXiv

Detection of multiple perturbations in multi-omics biological networks

Cellular mechanism-of-action is of fundamental concern in many biological studies. It is of particular interest for identifying the cause of disease and learning the way in which treatments act against disease. However, pinpointing such mechanisms is difficult, due to the fact that small perturbations to the cell can have wide-ranging downstream effects. Given a snapshot of cellular activity, it can be challenging to tell where a disturbance originated. The presence of an ever-greater variety of high-throughput biological data offers an opportunity to examine cellular behavior from multiple angles, but also presents the statistical challenge of how to effectively analyze data from multiple sources. In this setting, we propose a method for mechanism-of-action inference by extending network filtering to multi-attribute data. We first estimate a joint Gaussian graphical model across multiple data types using penalized regression and filter for network effects. We then apply a set of likelihood ratio tests to identify the most likely site of the original perturbation. In addition, we propose a conditional testing procedure to allow for detection of multiple perturbations. We demonstrate this methodology on paired gene expression and methylation data from The Cancer Genome Atlas (TCGA).

preprint2016arXiv

On the Propagation of Low-Rate Measurement Error to Subgraph Counts in Large Networks

Our work in this paper is inspired by a statistical observation that is both elementary and broadly relevant to network analysis in practice -- that the uncertainty in approximating some true network graph $G=(V,E)$ by some estimated graph $\hat{G}=(V,\hat{E})$ manifests as errors in the status of (non)edges that must necessarily propagate to any estimates of network summaries $η(G)$ we seek. Motivated by the common practice of using plug-in estimates $η(\hat{G})$ as proxies for $η(G)$, our focus is on the problem of characterizing the distribution of the discrepancy $D=η(\hat{G}) - η(G)$, in the case where $η(\cdot)$ is a subgraph count. Specifically, we study the fundamental case where the statistic of interest is $|E|$, the number of edges in $G$. Our primary contribution in this paper is to show that in the empirically relevant setting of large graphs with low-rate measurement errors, the distribution of $D_E=|\hat{E}| - |E|$ is well-characterized by a Skellam distribution, when the errors are independent or weakly dependent. Under an assumption of independent errors, we are able to further show conditions under which this characterization is strictly better than that of an appropriate normal distribution. These results derive from our formulation of a general result, quantifying the accuracy with which the difference of two sums of dependent Bernoulli random variables may be approximated by the difference of two independent Poisson random variables, i.e., by a Skellam distribution. This general result is developed through the use of Stein's method, and may be of some general interest. We finish with a discussion of possible extension of our work to subgraph counts $η(G)$ of higher order.

preprint2015arXiv

Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks

Networks are a popular tool for representing elements in a system and their interconnectedness. Many observed networks can be viewed as only samples of some true underlying network. Such is frequently the case, for example, in the monitoring and study of massive, online social networks. We study the problem of how to estimate the degree distribution - an object of fundamental interest - of a true underlying network from its sampled network. In particular, we show that this problem can be formulated as an inverse problem. Playing a key role in this formulation is a matrix relating the expectation of our sampled degree distribution to the true underlying degree distribution. Under many network sampling designs, this matrix can be defined entirely in terms of the design and is found to be ill-conditioned. As a result, our inverse problem frequently is ill-posed. Accordingly, we offer a constrained, penalized weighted least-squares approach to solving this problem. A Monte Carlo variant of Stein's unbiased risk estimation (SURE) is used to select the penalization parameter. We explore the behavior of our resulting estimator of network degree distribution in simulation, using a variety of combinations of network models and sampling regimes. In addition, we demonstrate the ability of our method to accurately reconstruct the degree distributions of various sub-communities within online social networks corresponding to Friendster, Orkut and LiveJournal. Overall, our results show that the true degree distributions from both homogeneous and inhomogeneous networks can be recovered with substantially greater accuracy than reflected in the empirical degree distribution resulting from the original sampling.

preprint2015arXiv

On the Question of Effective Sample Size in Network Modeling: An Asymptotic Inquiry

The modeling and analysis of networks and network data has seen an explosion of interest in recent years and represents an exciting direction for potential growth in statistics. Despite the already substantial amount of work done in this area to date by researchers from various disciplines, however, there remain many questions of a decidedly foundational nature - natural analogues of standard questions already posed and addressed in more classical areas of statistics - that have yet to even be posed, much less addressed. Here we raise and consider one such question in connection with network modeling. Specifically, we ask, "Given an observed network, what is the sample size?" Using simple, illustrative examples from the class of exponential random graph models, we show that the answer to this question can very much depend on basic properties of the networks expected under the model, as the number of vertices $n_V$ in the network grows. In particular, adopting the (asymptotic) scaling of the variance of the maximum likelihood parameter estimates as a notion of effective sample size ($n_{\mathrm{eff}}$), we show that when modeling the overall propensity to have ties and the propensity to reciprocate ties, whether the networks are sparse or not under the model (i.e., having a constant or an increasing number of ties per vertex, respectively) is sufficient to yield an order of magnitude difference in $n_{\mathrm{eff}}$, from $O(n_V)$ to $O(n^2_V)$. In addition, we report simulation study results that suggest similar properties for models for triadic (friend-of-a-friend) effects. We then explore some practical implications of this result, using both simulation and data on food-sharing from Lamalera, Indonesia.

preprint2014arXiv

Percolation under Noise: Detecting Explosive Percolation Using the Second Largest Component

We consider the problem of distinguishing classical (Erdős-Rényi) percolation from explosive (Achlioptas) percolation, under noise. A statistical model of percolation is constructed allowing for the birth and death of edges as well as the presence of noise in the observations. This graph-valued stochastic process is composed of a latent and an observed non-stationary process, where the observed graph process is corrupted by Type I and Type II errors. This produces a hidden Markov graph model. We show that for certain choices of parameters controlling the noise, the classical (ER) percolation is visually indistinguishable from the explosive (Achlioptas) percolation model. In this setting, we compare two different criteria for discriminating between these two percolation models, based on a quantile difference (QD) of the first component's size and on the maximal size of the second largest component. We show through data simulations that this second criterion outperforms the QD of the first component's size, in terms of discriminatory power. The maximal size of the second component therefore provides a useful statistic for distinguishing between the ER and Achlioptas models of percolation, under physically motivated conditions for the birth and death of edges, and under noise. The potential application of the proposed criteria for percolation detection in clinical neuroscience is also discussed.

preprint2014arXiv

Perturbation Detection Through Modeling of Gene Expression on a Latent Biological Pathway Network: A Bayesian hierarchical approach

Cellular response to a perturbation is the result of a dynamic system of biological variables linked in a complex network. A major challenge in drug and disease studies is identifying the key factors of a biological network that are essential in determining the cell's fate. Here our goal is the identification of perturbed pathways from high-throughput gene expression data. We develop a three-level hierarchical model, where (i) the first level captures the relationship between gene expression and biological pathways using confirmatory factor analysis, (ii) the second level models the behavior within an underlying network of pathways induced by an unknown perturbation using a conditional autoregressive model, and (iii) the third level is a spike-and-slab prior on the perturbations. We then identify perturbations through posterior-based variable selection. We illustrate our approach using gene transcription drug perturbation profiles from the DREAM7 drug sensitivity predication challenge data set. Our proposed method identified regulatory pathways that are known to play a causative role and that were not readily resolved using gene set enrichment analysis or exploratory factor models. Simulation results are presented assessing the performance of this model relative to a network-free variant and its robustness to inaccuracies in biological databases.

preprint2013arXiv

Exponential-type Inequalities Involving Ratios of the Modified Bessel Function of the First Kind and their Applications

The modified Bessel function of the first kind, $I_ν(x)$, arises in numerous areas of study, such as physics, signal processing, probability, statistics, etc. As such, there has been much interest in recent years in deducing properties of functionals involving $I_ν(x)$, in particular, of the ratio ${I_{ν+1}(x)}/{I_ν(x)}$, when $ν,x\geq 0$. In this paper we establish sharp upper and lower bounds on $H(ν,x)=\sum_{k=1}^{\infty} {I_{ν+k}(x)}/{I_ν(x)}$ for $ν,x\geq 0$ that appears as the complementary cumulative hazard function for a Skellam$(λ,λ)$ probability distribution in the statistical analysis of networks. Our technique relies on bounding existing estimates of ${I_{ν+1}(x)}/{I_ν(x)}$ from above and below by quantities with nicer algebraic properties, namely exponentials, to better evaluate the sum, while optimizing their rates in the regime when $ν+1\leq x$ in order to maintain their precision. We demonstrate the relevance of our results through applications, providing an improvement for the well-known asymptotic $\exp(-x)I_ν(x)\sim {1}/{\sqrt{2πx}}$ as $x\rightarrow \infty$, upper and lower bounding $\mathbb{P}\left[W=ν\right]$ for $W\sim Skellam(λ_1,λ_2)$, and deriving a novel concentration inequality on the $Skellam(λ,λ)$ probability distribution from above and below.

preprint2012arXiv

A Compressed PCA Subspace Method for Anomaly Detection in High-Dimensional Data

Random projection is widely used as a method of dimension reduction. In recent years, its combination with standard techniques of regression and classification has been explored. Here we examine its use with principal component analysis (PCA) and subspace detection methods. Specifically, we show that, under appropriate conditions, with high probability the magnitude of the residuals of a PCA analysis of randomly projected data behaves comparably to that of the residuals of a similar PCA analysis of the original data. Our results indicate the feasibility of applying subspace-based anomaly detection algorithms to randomly projected data, when the data are high-dimensional but have a covariance of an appropriately compressed nature. We illustrate in the context of computer network traffic anomaly detection.

preprint2012arXiv

Inference and Characterization of Multi-Attribute Networks with Application to Computational Biology

Our work is motivated by and illustrated with application of association networks in computational biology, specifically in the context of gene/protein regulatory networks. Association networks represent systems of interacting elements, where a link between two different elements indicates a sufficient level of similarity between element attributes. While in reality relational ties between elements can be expected to be based on similarity across multiple attributes, the vast majority of work to date on association networks involves ties defined with respect to only a single attribute. We propose an approach for the inference of multi-attribute association networks from measurements on continuous attribute variables, using canonical correlation and a hypothesis-testing strategy. Within this context, we then study the impact of partial information on multi-attribute network inference and characterization, when only a subset of attributes is available. We consider in detail the case of two attributes, wherein we examine through a combination of analytical and numerical techniques the implications of the choice and number of node attributes on the ability to detect network links and, more generally, to estimate higher-level network summary statistics, such as node degree, clustering coefficients, and measures of centrality. Illustration and applications throughout the paper are developed using gene and protein expression measurements on human cancer cell lines from the NCI-60 database.

preprint2010arXiv

Target Detection via Network Filtering

A method of `network filtering' has been proposed recently to detect the effects of certain external perturbations on the interacting members in a network. However, with large networks, the goal of detection seems a priori difficult to achieve, especially since the number of observations available often is much smaller than the number of variables describing the effects of the underlying network. Under the assumption that the network possesses a certain sparsity property, we provide a formal characterization of the accuracy with which the external effects can be detected, using a network filtering system that combines Lasso regression in a sparse simultaneous equation model with simple residual analysis. We explore the implications of the technical conditions underlying our characterization, in the context of various network topologies, and we illustrate our method using simulated data.

preprint2009arXiv

Network inference - with confidence - from multivariate time series

Networks - collections of interacting elements or nodes - abound in the natural and manmade worlds. For many networks, complex spatiotemporal dynamics stem from patterns of physical interactions unknown to us. To infer these interactions, it is common to include edges between those nodes whose time series exhibit sufficient functional connectivity, typically defined as a measure of coupling exceeding a pre-determined threshold. However, when uncertainty exists in the original network measurements, uncertainty in the inferred network is likely, and hence a statistical propagation-of-error is needed. In this manuscript, we describe a principled and systematic procedure for the inference of functional connectivity networks from multivariate time series data. Our procedure yields as output both the inferred network and a quantification of uncertainty of the most fundamental interest: uncertainty in the number of edges. To illustrate this approach, we apply our procedure to simulated data and electrocorticogram data recorded from a human subject during an epileptic seizure. We demonstrate that the procedure is accurate and robust in both the determination of edges and the reporting of uncertainty associated with that determination.

preprint2005arXiv

Network Inference from TraceRoute Measurements: Internet Topology `Species'

Internet mapping projects generally consist in sampling the network from a limited set of sources by using traceroute probes. This methodology, akin to the merging of spanning trees from the different sources to a set of destinations, leads necessarily to a partial, incomplete map of the Internet. Accordingly, determination of Internet topology characteristics from such sampled maps is in part a problem of statistical inference. Our contribution begins with the observation that the inference of many of the most basic topological quantities -- including network size and degree characteristics -- from traceroute measurements is in fact a version of the so-called `species problem' in statistics. This observation has important implications, since species problems are often quite challenging. We focus here on the most fundamental example of a traceroute internet species: the number of nodes in a network. Specifically, we characterize the difficulty of estimating this quantity through a set of analytical arguments, we use statistical subsampling principles to derive two proposed estimators, and we illustrate the performance of these estimators on networks with various topological characteristics.

preprint2004arXiv

A Statistical Framework for Efficient Monitoring of End-to-End Network Properties

Network service providers and customers are often concerned with aggregate performance measures that span multiple network paths. Unfortunately, forming such network-wide measures can be difficult, due to the issues of scale involved. In particular, the number of paths grows too rapidly with the number of endpoints to make exhaustive measurement practical. As a result, there is interest in the feasibility of methods that dramatically reduce the number of paths measured in such situations while maintaining acceptable accuracy. In previous work we proposed a statistical framework to efficiently address this problem, in the context of additive metrics such as delay and loss rate, for which the per-path metric is a sum of (possibly transformed) per-link measures. The key to our method lies in the observation and exploitation of significant redundancy in network paths (sharing of common links). In this paper we make three contributions: (1) we generalize the framework to make it more immediately applicable to network measurements encountered in practice; (2) we demonstrate that the observed path redundancy upon which our method is based is robust to variation in key network conditions and characteristics, including link failures; and (3) we show how the framework may be applied to address three practical problems of interest to network providers and customers, using data from an operating network. In particular, we show how appropriate selection of small sets of path measurements can be used to accurately estimate network-wide averages of path delays, to reliably detect network anomalies, and to effectively make a choice between alternative sub-networks, as a customer choosing between two providers or two ingress points into a provider network.

Eric D. Kolaczyk

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

Causal Inference under Network Interference with Noise

Disentangling positive and negative partisanship in social media interactions using a coevolving latent space network with attractors model

Bayesian classification, anomaly detection, and survival analysis using network inputs with application to the microbiome

Inferring the Type of Phase Transitions Undergone in Epileptic Seizures Using Random Graph Hidden Markov Models for Percolation in Noisy Dynamic Networks

Network Recovery from Unlabeled Noisy Samples

Estimation of subgraph density in noisy networks

Detection of multiple perturbations in multi-omics biological networks

On the Propagation of Low-Rate Measurement Error to Subgraph Counts in Large Networks

Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks

On the Question of Effective Sample Size in Network Modeling: An Asymptotic Inquiry

Percolation under Noise: Detecting Explosive Percolation Using the Second Largest Component

Perturbation Detection Through Modeling of Gene Expression on a Latent Biological Pathway Network: A Bayesian hierarchical approach

Exponential-type Inequalities Involving Ratios of the Modified Bessel Function of the First Kind and their Applications

A Compressed PCA Subspace Method for Anomaly Detection in High-Dimensional Data

Inference and Characterization of Multi-Attribute Networks with Application to Computational Biology

Target Detection via Network Filtering

Network inference - with confidence - from multivariate time series

Network Inference from TraceRoute Measurements: Internet Topology `Species'

A Statistical Framework for Efficient Monitoring of End-to-End Network Properties