Source author record

Carter T. Butts

Carter T. Butts appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Social and Information Networks Methodology physics.soc-ph physics.data-an Applications Computation Networking and Internet Architecture math.ST Statistics Theory cond-mat.stat-mech cs.CY Data Structures and Algorithms Machine Learning Populations and Evolution stat.OT

Catalog footprint

What is connected

20works

15topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Unified Prediction Framework for Signal Maps

Signal maps are essential for the planning and operation of cellular networks. However, the measurements needed to create such maps are expensive, often biased, not always reflecting the metrics of interest, and posing privacy risks. In this paper, we develop a unified framework for predicting cellular signal maps from limited measurements. Our framework builds on a state-of-the-art random-forest predictor, or any other base predictor. We propose and combine three mechanisms that deal with the fact that not all measurements are equally important for a particular prediction task. First, we design quality-of-service functions ($Q$), including signal strength (RSRP) but also other metrics of interest to operators, i.e., coverage and call drop probability. By implicitly altering the loss function employed in learning, quality functions can also improve prediction for RSRP itself where it matters (e.g., MSE reduction up to 27% in the low signal strength regime, where errors are critical). Second, we introduce weight functions ($W$) to specify the relative importance of prediction at different locations and other parts of the feature space. We propose re-weighting based on importance sampling to obtain unbiased estimators when the sampling and target distributions are different. This yields improvements up to 20% for targets based on spatially uniform loss or losses based on user population density. Third, we apply the Data Shapley framework for the first time in this context: to assign values ($ϕ$) to individual measurement points, which capture the importance of their contribution to the prediction task. This improves prediction (e.g., from 64% to 94% in recall for coverage loss) by removing points with negative values, and can also enable data minimization. We evaluate our methods and demonstrate significant improvement in prediction performance, using several real-world datasets.

preprint2022arXiv

Modeling Complex Interactions in a Disrupted Environment: Relational Events in the WTC Response

When subjected to a sudden, unanticipated threat, human groups characteristically self-organize to identify the threat, determine potential responses, and act to reduce its impact. Central to this process is the challenge of coordinating information sharing and response activity within a disrupted environment. In this paper, we consider coordination in the context of responses to the 2001 World Trade Center disaster. Using records of communications among 17 organizational units, we examine the mechanisms driving communication dynamics, with an emphasis on the emergence of coordinating roles. We employ relational event models (REMs) to identify the mechanisms shaping communications in each unit, finding a consistent pattern of behavior across units with very different characteristics. Using a simulation-based "knock-out" study, we also probe the importance of different mechanisms for hub formation. Our results suggest that, while preferential attachment and pre-disaster role structure generally contribute to the emergence of hub structure, temporally local conversational norms play a much larger role. We discuss broader implications for the role of microdynamics in driving macroscopic outcomes, and for the emergence of coordination in other settings.

preprint2020arXiv

Finite Mixtures of ERGMs for Modeling Ensembles of Networks

Ensembles of networks arise in many scientific fields, but there are few statistical tools for inferring their generative processes, particularly in the presence of both dyadic dependence and cross-graph heterogeneity. To fill in this gap, we propose characterizing network ensembles via finite mixtures of exponential family random graph models, a framework for parametric statistical modeling of graphs that has been successful in explicitly modeling the complex stochastic processes that govern the structure of edges in a network. Our proposed modeling framework can also be used for applications such as model-based clustering of ensembles of networks and density estimation for complex graph distributions. We develop a Metropolis-within-Gibbs algorithm to conduct fully Bayesian inference and adapt a version of deviance information criterion for missing data models to choose the number of latent heterogeneous generative mechanisms. Simulation studies show that the proposed procedure can recover the true number of latent heterogeneous generative processes and corresponding parameters. We demonstrate the utility of the proposed approach using an ensemble of political co-voting networks among U.S. Senators.

preprint2020arXiv

Kernel-based Approximate Bayesian Inference for Exponential Family Random Graph Models

Bayesian inference for exponential family random graph models (ERGMs) is a doubly-intractable problem because of the intractability of both the likelihood and posterior normalizing factor. Auxiliary variable based Markov Chain Monte Carlo (MCMC) methods for this problem are asymptotically exact but computationally demanding, and are difficult to extend to modified ERGM families. In this work, we propose a kernel-based approximate Bayesian computation algorithm for fitting ERGMs. By employing an adaptive importance sampling technique, we greatly improve the efficiency of the sampling step. Though approximate, our easily parallelizable approach is yields comparable accuracy to state-of-the-art methods with substantial improvements in compute time on multi-core hardware. Our approach also flexibly accommodates both algorithmic enhancements (including improved learning algorithms for estimating conditional expectations) and extensions to non-standard cases such as inference from non-sufficient statistics. We demonstrate the performance of this approach on two well-known network data sets, comparing its accuracy and efficiency with results obtained using the approximate exchange algorithm. Our tests show a wallclock time advantage of up to 50% with five cores, and the ability to fit models in 1/5th the time at 30 cores; further speed enhancements are possible when more cores are available.

preprint2020arXiv

Phase Transitions in the Edge/Concurrent Vertex Model

Although it is well-known that some exponential family random graph model (ERGM) families exhibit phase transitions (in which small parameter changes lead to qualitative changes in graph structure), the behavior of other models is still poorly understood. Recently, Krivitsky and Morris have reported a previously unobserved phase transition in the edge/concurrent vertex family (a simple starting point for models of sexual contact networks). Here, we examine this phase transition, showing it to be a first order transition with respect to an order parameter associated with the fraction of concurrent vertices. This transition stems from weak cooperativity in the recruitment of vertices to the concurrent phase, which may not be a desirable property in some applications.

preprint2020arXiv

Spatial Heterogeneity Can Lead to Substantial Local Variations in COVID-19 Timing and Severity

Standard epidemiological models for COVID-19 employ variants of compartment (SIR) models at local scales, implicitly assuming spatially uniform local mixing. Here, we examine the effect of employing more geographically detailed diffusion models based on known spatial features of interpersonal networks, most particularly the presence of a long-tailed but monotone decline in the probability of interaction with distance, on disease diffusion. Based on simulations of unrestricted COVID-19 diffusion in 19 U.S cities, we conclude that heterogeneity in population distribution can have large impacts on local pandemic timing and severity, even when aggregate behavior at larger scales mirrors a classic SIR-like pattern. Impacts observed include severe local outbreaks with long lag time relative to the aggregate infection curve, and the presence of numerous areas whose disease trajectories correlate poorly with those of neighboring areas. A simple catchment model for hospital demand illustrates potential implications for health care utilization, with substantial disparities in the timing and extremity of impacts even without distancing interventions. Likewise, analysis of social exposure to others who are morbid or deceased shows considerable variation in how the epidemic can appear to individuals on the ground, potentially affecting risk assessment and compliance with mitigation measures. These results demonstrate the potential for spatial network structure to generate highly non-uniform diffusion behavior even at the scale of cities, and suggest the importance of incorporating such structure when designing models to inform healthcare planning, predict community outcomes, or identify potential disparities.

preprint2019arXiv

A Dynamic Process Reference Model for Sparse Networks with Reciprocity

Many social and other networks exhibit stable size scaling relationships, such that features such as mean degree or reciprocation rates change slowly or are approximately constant as the number of vertices increases. Statistical network models built on top of simple Bernoulli baseline (or reference) measures often behave unrealistically in this respect, leading to the development of sparse reference models that preserve features such as mean degree scaling. In this paper, we generalize recent work on the micro-foundations of such reference models to the case of sparse directed graphs with non-vanishing reciprocity, providing a dynamic process interpretation of the emergence of stable macroscopic behavior.

preprint2018arXiv

A Dynamic Process Interpretation of the Sparse ERGM Reference Model

Exponential family random graph models (ERGMs) can be understood in terms of a set of structural biases that act on an underlying reference distribution. This distribution determines many aspects of the behavior and interpretation of the ERGM families incorporating it. One important innovation in this area has been the development of an ERGM reference model that produces realistic behavior when generalized to sparse networks of varying size. Here, we show that this model can be derived from a latent dynamic process in which tie formation takes place within small local settings between which individuals move. This derivation provides one possible micro-process interpretation of the sparse ERGM reference model, and sheds light on the conditions under which constant mean degree scaling can emerge.

preprint2017arXiv

A Perfect Sampling Method for Exponential Family Random Graph Models

Generation of deviates from random graph models with non-trivial edge dependence is an increasingly important problem. Here, we introduce a method which allows perfect sampling from random graph models in exponential family form ("exponential family random graph" models), using a variant of Coupling From The Past. We illustrate the use of the method via an application to the Markov graphs, a family that has been the subject of considerable research. We also show how the method can be applied to a variant of the biased net models, which are not exponentially parameterized.

preprint2016arXiv

Are you going to the party: depends, who else is coming? [Learning hidden group dynamics via conditional latent tree models]

Scalable probabilistic modeling and prediction in high dimensional multivariate time-series is a challenging problem, particularly for systems with hidden sources of dependence and/or homogeneity. Examples of such problems include dynamic social networks with co-evolving nodes and edges and dynamic student learning in online courses. Here, we address these problems through the discovery of hierarchical latent groups. We introduce a family of Conditional Latent Tree Models (CLTM), in which tree-structured latent variables incorporate the unknown groups. The latent tree itself is conditioned on observed covariates such as seasonality, historical activity, and node attributes. We propose a statistically efficient framework for learning both the hierarchical tree structure and the parameters of the CLTM. We demonstrate competitive performance in multiple real world datasets from different domains. These include a dataset on students' attempts at answering questions in a psychology MOOC, Twitter users participating in an emergency management discussion and interacting with one another, and windsurfers interacting on a beach in Southern California. In addition, our modeling framework provides valuable and interpretable information about the hidden group structures and their effect on the evolution of the time series.

preprint2015arXiv

Estimating Subgraph Frequencies with or without Attributes from Egocentrically Sampled Data

In this paper we show how to efficiently produce unbiased estimates of subgraph frequencies from a probability sample of egocentric networks (i.e., focal nodes, their neighbors, and the induced subgraphs of ties among their neighbors). A key feature of our proposed method that differentiates it from prior methods is the use of egocentric data. Because of this, our method is suitable for estimation in large unknown graphs, is easily parallelizable, handles privacy sensitive network data (e.g. egonets with no neighbor labels), and supports counting of large subgraphs (e.g. maximal clique of size 205 in Section 6) by building on top of existing exact subgraph counting algorithms that may not support sampling. It gracefully handles a variety of sampling designs such as uniform or weighted independence or random walk sampling. Our method can be used for subgraphs that are: (i) undirected or directed; (ii) induced or non-induced; (iii) maximal or non-maximal; and (iv) potentially annotated with attributes. We compare our estimators on a variety of real-world graphs and sampling methods and provide suggestions for their use. Simulation shows that our method outperforms the state-of-the-art approach for relative subgraph frequencies by up to an order of magnitude for the same sample size. Finally, we apply our methodology to a rare sample of Facebook users across the social graph to estimate and interpret the clique size distribution and gender composition of cliques.

preprint2014arXiv

ergm.graphlets: A Package for ERG Modeling Based on Graphlet Statistics

Exponential-family random graph models (ERGMs) are probabilistic network models that are parametrized by sufficient statistics based on structural (i.e., graph-theoretic) properties. The ergm package for the R statistical computing system is a collection of tools for the analysis of network data within an ERGM framework. Many different network properties can be employed as sufficient statistics for ERGMs by using the model terms defined in the ergm package; this functionality can be expanded by the creation of packages that code for additional network statistics. Here, our focus is on the addition of statistics based on graphlets. Graphlets are small, connected, and non-isomorphic induced subgraphs that describe the topological structure of a network. We introduce an R package called ergm.graphlets that enables the use of graphlet properties of a network within the ergm package of R. The ergm.graphlets package provides a complete list of model terms that allows to incorporate statistics of any 2-, 3-, 4- and 5-node graphlet into ERGMs. The new model terms of ergm.graphlets package enable both ERG modelling of global structural properties and investigation of relationships between nodal attributes (i.e., covariates) and local topologies around nodes.

preprint2013arXiv

Estimating Clique Composition and Size Distributions from Sampled Network Data

Cliques are defined as complete graphs or subgraphs; they are the strongest form of cohesive subgroup, and are of interest in both social science and engineering contexts. In this paper we show how to efficiently estimate the distribution of clique sizes from a probability sample of nodes obtained from a graph (e.g., by independence or link-trace sampling). We introduce two types of unbiased estimators, one of which exploits labeling of sampled nodes neighbors and one of which does not require this information. We compare the estimators on a variety of real-world graphs and provide suggestions for their use. We generalize our estimators to cases in which cliques are distinguished not only by size but also by node attributes, allowing us to estimate clique composition by size. Finally, we apply our methodology to a sample of Facebook users to estimate the clique size distribution by gender over the social graph.

preprint2012arXiv

Graph Size Estimation

Many online networks are not fully known and are often studied via sampling. Random Walk (RW) based techniques are the current state-of-the-art for estimating nodal attributes and local graph properties, but estimating global properties remains a challenge. In this paper, we are interested in a fundamental property of this type - the graph size N, i.e., the number of its nodes. Existing methods for estimating N are (i) inefficient and (ii) cannot be easily used with RW sampling due to dependence between successive samples. In this paper, we address both problems. First, we propose IE (Induced Edges), an efficient technique for estimating N from an independence sample of graph's nodes. IE exploits the edges induced on the sampled nodes. Second, we introduce SafetyMargin, a method that corrects estimators for dependence in RW samples. Finally, we combine these two stand-alone techniques to obtain a RW-based graph size estimator. We evaluate our approach in simulations on a wide range of real-life topologies, and on several samples of Facebook. IE with SafetyMargin typically requires at least 10 times fewer samples than the state-of-the-art techniques (over 100 times in the case of Facebook) for the same estimation error.

preprint2012arXiv

Hierarchical Models for Relational Event Sequences

Interaction within small groups can often be represented as a sequence of events, where each event involves a sender and a recipient. Recent methods for modeling network data in continuous time model the rate at which individuals interact conditioned on the previous history of events as well as actor covariates. We present a hierarchical extension for modeling multiple such sequences, facilitating inferences about event-level dynamics and their variation across sequences. The hierarchical approach allows one to share information across sequences in a principled manner---we illustrate the efficacy of such sharing through a set of prediction experiments. After discussing methods for adequacy checking and model selection for this class of models, the method is illustrated with an analysis of high school classroom dynamics.

preprint2011arXiv

A Walk in Facebook: Uniform Sampling of Users in Online Social Networks

Our goal in this paper is to develop a practical framework for obtaining a uniform sample of users in an online social network (OSN) by crawling its social graph. Such a sample allows to estimate any user property and some topological properties as well. To this end, first, we consider and compare several candidate crawling techniques. Two approaches that can produce approximately uniform samples are the Metropolis-Hasting random walk (MHRW) and a re-weighted random walk (RWRW). Both have pros and cons, which we demonstrate through a comparison to each other as well as to the "ground truth." In contrast, using Breadth-First-Search (BFS) or an unadjusted Random Walk (RW) leads to substantially biased results. Second, and in addition to offline performance assessment, we introduce online formal convergence diagnostics to assess sample quality during the data collection process. We show how these diagnostics can be used to effectively determine when a random walk sample is of adequate size and quality. Third, as a case study, we apply the above methods to Facebook and we collect the first, to the best of our knowledge, representative sample of Facebook users. We make it publicly available and employ it to characterize several key properties of Facebook.

preprint2011arXiv

Coarse-Grained Topology Estimation via Graph Sampling

Many online networks are measured and studied via sampling techniques, which typically collect a relatively small fraction of nodes and their associated edges. Past work in this area has primarily focused on obtaining a representative sample of nodes and on efficient estimation of local graph properties (such as node degree distribution or any node attribute) based on that sample. However, less is known about estimating the global topology of the underlying graph. In this paper, we show how to efficiently estimate the coarse-grained topology of a graph from a probability sample of nodes. In particular, we consider that nodes are partitioned into categories (e.g., countries or work/study places in OSNs), which naturally defines a weighted category graph. We are interested in estimating (i) the size of categories and (ii) the probability that nodes from two different categories are connected. For each of the above, we develop a family of estimators for design-based inference under uniform or non-uniform sampling, employing either of two measurement strategies: induced subgraph sampling, which relies only on information about the sampled nodes; and star sampling, which also exploits category information about the neighbors of sampled nodes. We prove consistency of these estimators and evaluate their efficiency via simulation on fully known graphs. We also apply our methodology to a sample of Facebook users to obtain a number of category graphs, such as the college friendship graph and the country friendship graph; we share and visualize the resulting data at www.geosocialmap.com.

preprint2011arXiv

Contending Parties: A Logistic Choice Analysis of Inter- and Intra-group Blog Citation Dynamics in the 2004 US Presidential Election

The 2004 US Presidential Election cycle marked the debut of Internet-based media such as blogs and social networking websites as institutionally recognized features of the American political landscape. Using a longitudinal sample of all DNC/RNC-designated blog-citation networks we are able to test the influence of various strategic, institutional, and balance-theoretic mechanisms and exogenous factors such as seasonality and political events on the propensity of blogs to cite one another over time. Capitalizing on the temporal resolution of our data, we utilize an autoregressive network regression framework to carry out inference for a logistic choice process. Using a combination of deviance-based model selection criteria and simulation-based model adequacy tests, we identify the combination of processes that best characterizes the choice behavior of the contending blogs.

preprint2011arXiv

Logistic Network Regression for Scalable Analysis of Networks with Joint Edge/Vertex Dynamics

Network dynamics may be viewed as a process of change in the edge structure of a network, in the vertex set on which edges are defined, or in both simultaneously. Though early studies of such processes were primarily descriptive, recent work on this topic has increasingly turned to formal statistical models. While showing great promise, many of these modern dynamic models are computationally intensive and scale very poorly in the size of the network under study and/or the number of time points considered. Likewise, currently employed models focus on edge dynamics, with little support for endogenously changing vertex sets. Here, we show how an existing approach based on logistic network regression can be extended to serve as highly scalable framework for modeling large networks with dynamic vertex sets. We place this approach within a general dynamic exponential family (ERGM) context, clarifying the assumptions underlying the framework (and providing a clear path for extensions), and show how model assessment methods for cross-sectional networks can be extended to the dynamic case. Finally, we illustrate this approach on a classic data set involving interactions among windsurfers on a California beach.

preprint2011arXiv

Multigraph Sampling of Online Social Networks

State-of-the-art techniques for probability sampling of users of online social networks (OSNs) are based on random walks on a single social relation (typically friendship). While powerful, these methods rely on the social graph being fully connected. Furthermore, the mixing time of the sampling process strongly depends on the characteristics of this graph. In this paper, we observe that there often exist other relations between OSN users, such as membership in the same group or participation in the same event. We propose to exploit the graphs these relations induce, by performing a random walk on their union multigraph. We design a computationally efficient way to perform multigraph sampling by randomly selecting the graph on which to walk at each iteration. We demonstrate the benefits of our approach through (i) simulation in synthetic graphs, and (ii) measurements of Last.fm - an Internet website for music with social networking features. More specifically, we show that multigraph sampling can obtain a representative sample and faster convergence, even when the individual graphs fail, i.e., are disconnected or highly clustered.

Carter T. Butts

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

A Unified Prediction Framework for Signal Maps

Modeling Complex Interactions in a Disrupted Environment: Relational Events in the WTC Response

Finite Mixtures of ERGMs for Modeling Ensembles of Networks

Kernel-based Approximate Bayesian Inference for Exponential Family Random Graph Models

Phase Transitions in the Edge/Concurrent Vertex Model

Spatial Heterogeneity Can Lead to Substantial Local Variations in COVID-19 Timing and Severity

A Dynamic Process Reference Model for Sparse Networks with Reciprocity

A Dynamic Process Interpretation of the Sparse ERGM Reference Model

A Perfect Sampling Method for Exponential Family Random Graph Models

Are you going to the party: depends, who else is coming? [Learning hidden group dynamics via conditional latent tree models]

Estimating Subgraph Frequencies with or without Attributes from Egocentrically Sampled Data

ergm.graphlets: A Package for ERG Modeling Based on Graphlet Statistics

Estimating Clique Composition and Size Distributions from Sampled Network Data

Graph Size Estimation

Hierarchical Models for Relational Event Sequences

A Walk in Facebook: Uniform Sampling of Users in Online Social Networks

Coarse-Grained Topology Estimation via Graph Sampling

Contending Parties: A Logistic Choice Analysis of Inter- and Intra-group Blog Citation Dynamics in the 2004 US Presidential Election

Logistic Network Regression for Scalable Analysis of Networks with Joint Edge/Vertex Dynamics

Multigraph Sampling of Online Social Networks