Source author record

Aaron Clauset

Aaron Clauset appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

physics.data-an physics.soc-ph Social and Information Networks Machine Learning Applications cs.CY Quantitative Methods cond-mat.dis-nn Methodology Molecular Networks Populations and Evolution Human-Computer Interaction nlin.AO Biological Physics Digital Libraries Genomics

Catalog footprint

What is connected

33works

16topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

An Open-Source Cultural Consensus Approach to Name-Based Gender Classification

Name-based gender classification has enabled hundreds of otherwise infeasible scientific studies of gender. Yet, the lack of standardization, proliferation of ad hoc methods, reliance on paid services, understudied limitations, and conceptual debates cast a shadow over many applications. To address these problems we develop and evaluate an ensemble-based open-source method built on publicly available data of empirical name-gender associations. Our method integrates 36 distinct sources-spanning over 150 countries and more than a century-via a meta-learning algorithm inspired by Cultural Consensus Theory (CCT). We also construct a taxonomy with which names themselves can be classified. We find that our method's performance is competitive with paid services and that our method, and others, approach the upper limits of performance; we show that conditioning estimates on additional metadata (e.g. cultural context), further combining methods, or collecting additional name-gender association data is unlikely to meaningfully improve performance. This work definitively shows that name-based gender classification can be a reliable part of scientific research and provides a pair of tools, a classification method and a taxonomy of names, that realize this potential.

preprint2022arXiv

Labor advantages drive the greater productivity of faculty at elite universities

Faculty at prestigious institutions dominate scientific discourse, with the small proportion of researchers at elite universities producing a disproportionate share of all research publications. Environmental prestige is known to drive such epistemic disparity, but the mechanisms by which it causes increased faculty productivity remain unknown. Here we combine employment, publication, and federal survey data for 78,802 tenure-track faculty at 262 PhD-granting institutions in the American university system between 2008--2017 to show through multiple lines of evidence that the greater availability of funded graduate and postdoctoral labor at more prestigious institutions drives the environmental effect of prestige on productivity. In particular, we show that greater environmental prestige leads to larger faculty-led research groups, which drive higher faculty productivity, primarily in disciplines with research group collaboration norms. In contrast, we show that productivity does not increase substantially with prestige for either faculty papers published without group members, nor group members themselves. The disproportionate scientific productivity of elite researchers is thus largely explained by their substantial labor advantage, indicating a more limited role for prestige itself in predicting scientific contributions.

preprint2022arXiv

Subfield prestige and gender inequality in computing

Women and people of color remain dramatically underrepresented among computing faculty, and improvements in demographic diversity are slow and uneven. Effective diversification strategies depend on quantifying the correlates, causes, and trends of diversity in the field. But field-level demographic changes are driven by subfield hiring dynamics because faculty searches are typically at the subfield level. Here, we quantify and forecast variations in the demographic composition of the subfields of computing using a comprehensive database of training and employment records for 6882 tenure-track faculty from 269 PhD-granting computing departments in the United States, linked with 327,969 publications. We find that subfield prestige correlates with gender inequality, such that faculty working in computing subfields with more women tend to hold positions at less prestigious institutions. In contrast, we find no significant evidence of racial or socioeconomic differences by subfield. Tracking representation over time, we find steady progress toward gender equality in all subfields, but more prestigious subfields tend to be roughly 25 years behind the less prestigious subfields in gender representation. These results illustrate how the choice of subfield in a faculty search can shape a department's gender diversity.

preprint2021arXiv

The Dynamics of Faculty Hiring Networks

Faculty hiring networks-who hires whose graduates as faculty-exhibit steep hierarchies, which can reinforce both social and epistemic inequalities in academia. Understanding the mechanisms driving these patterns would inform efforts to diversify the academy and shed new light on the role of hiring in shaping which scientific discoveries are made. Here, we investigate the degree to which structural mechanisms can explain hierarchy and other network characteristics observed in empirical faculty hiring networks. We study a family of adaptive rewiring network models, which reinforce institutional prestige within the hierarchy in five distinct ways. Each mechanism determines the probability that a new hire comes from a particular institution according to that institution's prestige score, which is inferred from the hiring network's existing structure. We find that structural inequalities and centrality patterns in real hiring networks are best reproduced by a mechanism of global placement power, in which a new hire is drawn from a particular institution in proportion to the number of previously drawn hires anywhere. On the other hand, network measures of biased visibility are better recapitulated by a mechanism of local placement power, in which a new hire is drawn from a particular institution in proportion to the number of its previous hires already present at the hiring institution. These contrasting results suggest that the underlying structural mechanism reinforcing hierarchies in faculty hiring networks is a mixture of global and local preference for institutional prestige. Under these dynamics, we show that each institution's position in the hierarchy is remarkably stable, due to a dynamic competition that overwhelmingly favors more prestigious institutions.

preprint2019arXiv

Evaluating Overfit and Underfit in Models of Network Community Structure

A common data mining task on networks is community detection, which seeks an unsupervised decomposition of a network into structural groups based on statistical regularities in the network's connectivity. Although many methods exist, the No Free Lunch theorem for community detection implies that each makes some kind of tradeoff, and no algorithm can be optimal on all inputs. Thus, different algorithms will over or underfit on different inputs, finding more, fewer, or just different communities than is optimal, and evaluation methods that use a metadata partition as a ground truth will produce misleading conclusions about general accuracy. Here, we present a broad evaluation of over and underfitting in community detection, comparing the behavior of 16 state-of-the-art community detection algorithms on a novel and structurally diverse corpus of 406 real-world networks. We find that (i) algorithms vary widely both in the number of communities they find and in their corresponding composition, given the same input, (ii) algorithms can be clustered into distinct high-level groups based on similarities of their outputs on real-world networks, and (iii) these differences induce wide variation in accuracy on link prediction and link description tasks. We introduce a new diagnostic for evaluating overfitting and underfitting in practice, and use it to roughly divide community detection methods into general and specialized learning algorithms. Across methods and inputs, Bayesian techniques based on the stochastic block model and a minimum description length approach to regularization represent the best general learning approach, but can be outperformed under specific circumstances. These results introduce both a theoretically principled approach to evaluate over and underfitting in models of network community structure and a realistic benchmark by which new methods may be evaluated and compared.

preprint2016arXiv

Eigenvector-Based Centrality Measures for Temporal Networks

Numerous centrality measures have been developed to quantify the importances of nodes in time-independent networks, and many of them can be expressed as the leading eigenvector of some matrix. With the increasing availability of network data that changes in time, it is important to extend such eigenvector-based centrality measures to time-dependent networks. In this paper, we introduce a principled generalization of network centrality measures that is valid for any eigenvector-based centrality. We consider a temporal network with N nodes as a sequence of T layers that describe the network during different time windows, and we couple centrality matrices for the layers into a supra-centrality matrix of size NTxNT whose dominant eigenvector gives the centrality of each node i at each time t. We refer to this eigenvector and its components as a joint centrality, as it reflects the importances of both the node i and the time layer t. We also introduce the concepts of marginal and conditional centralities, which facilitate the study of centrality trajectories over time. We find that the strength of coupling between layers is important for determining multiscale properties of centrality, such as localization phenomena and the time scale of centrality changes. In the strong-coupling regime, we derive expressions for time-averaged centralities, which are given by the zeroth-order terms of a singular perturbation expansion. We also study first-order terms to obtain first-order-mover scores, which concisely describe the magnitude of nodes' centrality changes over time. As examples, we apply our method to three empirical temporal networks: the United States Ph.D. exchange in mathematics, costarring relationships among top-billed actors during the Golden Age of Hollywood, and citations of decisions from the United States Supreme Court.

preprint2016arXiv

Gender, Productivity, and Prestige in Computer Science Faculty Hiring Networks

Women are dramatically underrepresented in computer science at all levels in academia and account for just 15% of tenure-track faculty. Understanding the causes of this gender imbalance would inform both policies intended to rectify it and employment decisions by departments and individuals. Progress in this direction, however, is complicated by the complexity and decentralized nature of faculty hiring and the non-independence of hires. Using comprehensive data on both hiring outcomes and scholarly productivity for 2659 tenure-track faculty across 205 Ph.D.-granting departments in North America, we investigate the multi-dimensional nature of gender inequality in computer science faculty hiring through a network model of the hiring process. Overall, we find that hiring outcomes are most directly affected by (i) the relative prestige between hiring and placing institutions and (ii) the scholarly productivity of the candidates. After including these, and other features, the addition of gender did not significantly reduce modeling error. However, gender differences do exist, e.g., in scholarly productivity, postdoctoral training rates, and in career movements up the rankings of universities, suggesting that the effects of gender are indirectly incorporated into hiring decisions through gender's covariates. Furthermore, we find evidence that more highly ranked departments recruit female faculty at higher than expected rates, which appears to inhibit similar efforts by lower ranked departments. These findings illustrate the subtle nature of gender inequality in faculty hiring networks and provide new insights to the underrepresentation of women in computer science.

preprint2015arXiv

Assembling thefacebook: Using heterogeneity to understand online social network assembly

Online social networks represent a popular and diverse class of social media systems. Despite this variety, each of these systems undergoes a general process of online social network assembly, which represents the complicated and heterogeneous changes that transform newly born systems into mature platforms. However, little is known about this process. For example, how much of a network's assembly is driven by simple growth? How does a network's structure change as it matures? How does network structure vary with adoption rates and user heterogeneity, and do these properties play different roles at different points in the assembly? We investigate these and other questions using a unique dataset of online connections among the roughly one million users at the first 100 colleges admitted to Facebook, captured just 20 months after its launch. We first show that different vintages and adoption rates across this population of networks reveal temporal dynamics of the assembly process, and that assembly is only loosely related to network growth. We then exploit natural experiments embedded in this dataset and complementary data obtained via Internet archaeology to show that different subnetworks matured at different rates toward similar end states. These results shed light on the processes and patterns of online social network assembly, and may facilitate more effective design for online social systems.

preprint2015arXiv

Detectability thresholds and optimal algorithms for community structure in dynamic networks

We study the fundamental limits on learning latent community structure in dynamic networks. Specifically, we study dynamic stochastic block models where nodes change their community membership over time, but where edges are generated independently at each time step. In this setting (which is a special case of several existing models), we are able to derive the detectability threshold exactly, as a function of the rate of change and the strength of the communities. Below this threshold, we claim that no algorithm can identify the communities better than chance. We then give two algorithms that are optimal in the sense that they succeed all the way down to this limit. The first uses belief propagation (BP), which gives asymptotically optimal accuracy, and the second is a fast spectral clustering algorithm, based on linearizing the BP equations. We verify our analytic and algorithmic results via numerical simulation, and close with a brief discussion of extensions and open questions.

preprint2015arXiv

Predicting sports scoring dynamics with restoration and anti-persistence

Professional team sports provide an excellent domain for studying the dynamics of social competitions. These games are constructed with simple, well-defined rules and payoffs that admit a high-dimensional set of possible actions and nontrivial scoring dynamics. The resulting gameplay and efforts to predict its evolution are the object of great interest to both sports professionals and enthusiasts. In this paper, we consider two online prediction problems for team sports:~given a partially observed game Who will score next? and ultimately Who will win? We present novel interpretable generative models of within-game scoring that allow for dependence on lead size (restoration) and on the last team to score (anti-persistence). We then apply these models to comprehensive within-game scoring data for four sports leagues over a ten year period. By assessing these models' relative goodness-of-fit we shed new light on the underlying mechanisms driving the observed scoring dynamics of each sport. Furthermore, in both predictive tasks, the performance of our models consistently outperforms baselines models, and our models make quantitative assessments of the latent team skill, over time.

preprint2015arXiv

Structure and inference in annotated networks

For many networks of scientific interest we know both the connections of the network and information about the network nodes, such as the age or gender of individuals in a social network, geographic location of nodes in the Internet, or cellular function of nodes in a gene regulatory network. Here we demonstrate how this "metadata" can be used to improve our analysis and understanding of network structure. We focus in particular on the problem of community detection in networks and develop a mathematically principled approach that combines a network and its metadata to detect communities more accurately than can be done with either alone. Crucially, the method does not assume that the metadata are correlated with the communities we are trying to find. Instead the method learns whether a correlation exists and correctly uses or ignores the metadata depending on whether they contain useful information. The learned correlations are also of interest in their own right, allowing us to make predictions about the community membership of nodes whose network connections are unknown. We demonstrate our method on synthetic networks with known structure and on real-world networks, large and small, drawn from social, biological, and technological domains.

preprint2015arXiv

Untangling the roles of parasites in food webs with generative network models

Food webs represent the set of consumer-resource interactions among a set of species that co-occur in a habitat, but most food web studies have omitted parasites and their interactions. Recent studies have provided conflicting evidence on whether including parasites changes food web structure, with some suggesting that parasitic interactions are structurally distinct from those among free-living species while others claim the opposite. Here, we describe a principled method for understanding food web structure that combines an efficient optimization algorithm from statistical physics called parallel tempering with a probabilistic generalization of the empirically well-supported food web niche model. This generative model approach allows us to rigorously estimate the degree to which interactions that involve parasites are statistically distinguishable from interactions among free-living species, whether parasite niches behave similarly to free-living niches, and the degree to which existing hypotheses about food web structure are naturally recovered. We apply this method to the well-studied Flensburg Fjord food web and show that while predation on parasites, concomitant predation of parasites, and parasitic intraguild trophic interactions are largely indistinguishable from free-living predation interactions, parasite-host interactions are different. These results provide a powerful new tool for evaluating the impact of classes of species and interactions on food web structure to shed new light on the roles of parasites in food webs

preprint2014arXiv

A unified view of generative models for networks: models, methods, opportunities, and challenges

Research on probabilistic models of networks now spans a wide variety of fields, including physics, sociology, biology, statistics, and machine learning. These efforts have produced a diverse ecology of models and methods. Despite this diversity, many of these models share a common underlying structure: pairwise interactions (edges) are generated with probability conditional on latent vertex attributes. Differences between models generally stem from different philosophical choices about how to learn from data or different empirically-motivated goals. The highly interdisciplinary nature of work on these generative models, however, has inhibited the development of a unified view of their similarities and differences. For instance, novel theoretical models and optimization techniques developed in machine learning are largely unknown within the social and biological sciences, which have instead emphasized model interpretability. Here, we describe a unified view of generative models for networks that draws together many of these disparate threads and highlights the fundamental similarities and differences that span these fields. We then describe a number of opportunities and challenges for future work that are revealed by this view.

preprint2014arXiv

Detecting change points in the large-scale structure of evolving networks

Interactions among people or objects are often dynamic in nature and can be represented as a sequence of networks, each providing a snapshot of the interactions over a brief period of time. An important task in analyzing such evolving networks is change-point detection, in which we both identify the times at which the large-scale pattern of interactions changes fundamentally and quantify how large and what kind of change occurred. Here, we formalize for the first time the network change-point detection problem within an online probabilistic learning framework and introduce a method that can reliably solve it. This method combines a generalized hierarchical random graph model with a Bayesian hypothesis test to quantitatively determine if, when, and precisely how a change point has occurred. We analyze the detectability of our method using synthetic data with known change points of different types and magnitudes, and show that this method is more accurate than several previously used alternatives. Applied to two high-resolution evolving social networks, this method identifies a sequence of change points that align with known external "shocks" to these networks.

preprint2014arXiv

Efficiently inferring community structure in bipartite networks

Bipartite networks are a common type of network data in which there are two types of vertices, and only vertices of different types can be connected. While bipartite networks exhibit community structure like their unipartite counterparts, existing approaches to bipartite community detection have drawbacks, including implicit parameter choices, loss of information through one-mode projections, and lack of interpretability. Here we solve the community detection problem for bipartite networks by formulating a bipartite stochastic block model, which explicitly includes vertex type information and may be trivially extended to $k$-partite networks. This bipartite stochastic block model yields a projection-free and statistically principled method for community detection that makes clear assumptions and parameter choices and yields interpretable results. We demonstrate this model's ability to efficiently and accurately find community structure in synthetic bipartite networks with known structure and in real-world bipartite networks with unknown structure, and we characterize its performance in practical contexts.

preprint2014arXiv

Estimating the historical and future probabilities of large terrorist events

Quantities with right-skewed distributions are ubiquitous in complex social systems, including political conflict, economics and social networks, and these systems sometimes produce extremely large events. For instance, the 9/11 terrorist events produced nearly 3000 fatalities, nearly six times more than the next largest event. But, was this enormous loss of life statistically unlikely given modern terrorism's historical record? Accurately estimating the probability of such an event is complicated by the large fluctuations in the empirical distribution's upper tail. We present a generic statistical algorithm for making such estimates, which combines semi-parametric models of tail behavior and a nonparametric bootstrap. Applied to a global database of terrorist events, we estimate the worldwide historical probability of observing at least one 9/11-sized or larger event since 1968 to be 11-35%. These results are robust to conditioning on global variations in economic development, domestic versus international events, the type of weapon used and a truncated history that stops at 1998. We then use this procedure to make a data-driven statistical forecast of at least one similar event over the next decade.

preprint2014arXiv

Learning Latent Block Structure in Weighted Networks

Community detection is an important task in network analysis, in which we aim to learn a network partition that groups together vertices with similar community-level connectivity patterns. By finding such groups of vertices with similar structural roles, we extract a compact representation of the network's large-scale structure, which can facilitate its scientific interpretation and the prediction of unknown or future interactions. Popular approaches, including the stochastic block model, assume edges are unweighted, which limits their utility by throwing away potentially useful information. We introduce the `weighted stochastic block model' (WSBM), which generalizes the stochastic block model to networks with edge weights drawn from any exponential family distribution. This model learns from both the presence and weight of edges, allowing it to discover structure that would otherwise be hidden when weights are discarded or thresholded. We describe a Bayesian variational algorithm for efficiently approximating this model's posterior distribution over latent block structures. We then evaluate the WSBM's performance on both edge-existence and edge-weight prediction tasks for a set of real-world weighted networks. In all cases, the WSBM performs as well or better than the best alternatives on these tasks.

preprint2014arXiv

Power-law distributions in binned empirical data

Many man-made and natural phenomena, including the intensity of earthquakes, population of cities and size of international wars, are believed to follow power-law distributions. The accurate identification of power-law patterns has significant consequences for correctly understanding and modeling complex systems. However, statistical evidence for or against the power-law hypothesis is complicated by large fluctuations in the empirical distribution's tail, and these are worsened when information is lost from binning the data. We adapt the statistically principled framework for testing the power-law hypothesis, developed by Clauset, Shalizi and Newman, to the case of binned data. This approach includes maximum-likelihood fitting, a hypothesis test based on the Kolmogorov--Smirnov goodness-of-fit statistic and likelihood ratio tests for comparing against alternative explanations. We evaluate the effectiveness of these methods on synthetic binned data with known structure, quantify the loss of statistical power due to binning, and apply the methods to twelve real-world binned data sets with heavy-tailed patterns.

preprint2014arXiv

Rejoinder of "Estimating the historical and future probabilities of large terrorist events" by Aaron Clauset and Ryan Woodard

Rejoinder of "Estimating the historical and future probabilities of large terrorist events" by Aaron Clauset and Ryan Woodard [arXiv:1209.0089].

preprint2014arXiv

Scoring dynamics across professional team sports: tempo, balance and predictability

Despite growing interest in quantifying and modeling the scoring dynamics within professional sports games, relative little is known about what patterns or principles, if any, cut across different sports. Using a comprehensive data set of scoring events in nearly a dozen consecutive seasons of college and professional (American) football, professional hockey, and professional basketball, we identify several common patterns in scoring dynamics. Across these sports, scoring tempo---when scoring events occur---closely follows a common Poisson process, with a sport-specific rate. Similarly, scoring balance---how often a team wins an event---follows a common Bernoulli process, with a parameter that effectively varies with the size of the lead. Combining these processes within a generative model of gameplay, we find they both reproduce the observed dynamics in all four sports and accurately predict game outcomes. These results demonstrate common dynamical patterns underlying within-game scoring dynamics across professional team sports, and suggest specific mechanisms for driving them. We close with a brief discussion of the implications of our results for several popular hypotheses about sports dynamics.

preprint2013arXiv

A network approach to analyzing highly recombinant malaria parasite genes

The var genes of the human malaria parasite Plasmodium falciparum present a challenge to population geneticists due to their extreme diversity, which is generated by high rates of recombination. These genes encode a primary antigen protein called PfEMP1, which is expressed on the surface of infected red blood cells and elicits protective immune responses. Var gene sequences are characterized by pronounced mosaicism, precluding the use of traditional phylogenetic tools that require bifurcating tree-like evolutionary relationships. We present a new method that identifies highly variable regions (HVRs), and then maps each HVR to a complex network in which each sequence is a node and two nodes are linked if they share an exact match of significant length. Here, networks of var genes that recombine freely are expected to have a uniformly random structure, but constraints on recombination will produce network communities that we identify using a stochastic block model. We validate this method on synthetic data, showing that it correctly recovers populations of constrained recombination, before applying it to the Duffy Binding Like-α (DBLα) domain of var genes. We find nine HVRs whose network communities map in distinctive ways to known DBLα classifications and clinical phenotypes. We show that the recombinational constraints of some HVRs are correlated, while others are independent. These findings suggest that this micromodular structuring facilitates independent evolutionary trajectories of neighboring mosaic regions, allowing the parasite to retain protein function while generating enormous sequence diversity. Our approach therefore offers a rigorous method for analyzing evolutionary constraints in var genes, and is also flexible enough to be easily applied more generally to any highly recombinant sequences.

preprint2013arXiv

Adapting the Stochastic Block Model to Edge-Weighted Networks

We generalize the stochastic block model to the important case in which edges are annotated with weights drawn from an exponential family distribution. This generalization introduces several technical difficulties for model estimation, which we solve using a Bayesian approach. We introduce a variational algorithm that efficiently approximates the model's posterior distribution for dense graphs. In specific numerical experiments on edge-weighted networks, this weighted stochastic block model outperforms the common approach of first applying a single threshold to all weights and then applying the classic stochastic block model, which can obscure latent block structure in networks. This model will enable the recovery of latent structure in a broader range of network data than was previously possible.

preprint2013arXiv

Detecting Friendship Within Dynamic Online Interaction Networks

In many complex social systems, the timing and frequency of interactions between individuals are observable but friendship ties are hidden. Recovering these hidden ties, particularly for casual users who are relatively less active, would enable a wide variety of friendship-aware applications in domains where labeled data are often unavailable, including online advertising and national security. Here, we investigate the accuracy of multiple statistical features, based either purely on temporal interaction patterns or on the cooperative nature of the interactions, for automatically extracting latent social ties. Using self-reported friendship and non-friendship labels derived from an anonymous online survey, we learn highly accurate predictors for recovering hidden friendships within a massive online data set encompassing 18 billion interactions among 17 million individuals of the popular online game Halo: Reach. We find that the accuracy of many features improves as more data accumulates, and cooperative features are generally reliable. However, periodicities in interaction time series are sufficient to correctly classify 95% of ties, even for casual users. These results clarify the nature of friendship in online social environments and suggest new opportunities and new privacy concerns for friendship-aware applications that do not require the disclosure of private friendship information.

preprint2013arXiv

Environmental structure and competitive scoring advantages in team competitions

In most professional sports, the structure of the environment is kept neutral so that scoring imbalances may be attributed to differences in team skill. It thus remains unknown what impact structural heterogeneities can have on scoring dynamics and producing competitive advantages. Applying a generative model of scoring dynamics to roughly 10 million team competitions drawn from an online game, we quantify the relationship between a competition's structure and its scoring dynamics. Despite wide structural variations, we find the same three-phase pattern in the tempo of events observed in many sports. Tempo and balance are highly predictable from a competition's structural features alone and teams exploit environmental heterogeneities for sustained competitive advantage. The most balanced competitions are associated with specific environmental heterogeneities, not from equally skilled teams. These results shed new light on the principles of balanced competition, and illustrate the potential of online game data for investigating social dynamics and competition.

preprint2013arXiv

Friends FTW! Friendship, Collaboration and Competition in Halo: Reach

How important are friendships in determining success by individuals and teams in complex collaborative environments? By combining a novel data set containing the dynamics of millions of ad hoc teams from the popular multiplayer online first person shooter Halo: Reach with survey data on player demographics, play style, psychometrics and friendships derived from an anonymous online survey, we investigate the impact of friendship on collaborative and competitive performance. In addition to finding significant differences in player behavior across these variables, we find that friendships exert a strong influence, leading to both improved individual and team performance--even after controlling for the overall expertise of the team--and increased pro-social behaviors. Players also structure their in-game activities around social opportunities, and as a result hidden friendship ties can be accurately inferred directly from behavioral time series. Virtual environments that enable such friendship effects will thus likely see improved collaboration and competition.

preprint2013arXiv

How large should whales be?

The evolution and distribution of species body sizes for terrestrial mammals is well-explained by a macroevolutionary tradeoff between short-term selective advantages and long-term extinction risks from increased species body size, unfolding above the 2g minimum size induced by thermoregulation in air. Here, we consider whether this same tradeoff, formalized as a constrained convection-reaction-diffusion system, can also explain the sizes of fully aquatic mammals, which have not previously been considered. By replacing the terrestrial minimum with a pelagic one, at roughly 7000g, the terrestrial mammal tradeoff model accurately predicts, with no tunable parameters, the observed body masses of all extant cetacean species, including the 175,000,000g Blue Whale. This strong agreement between theory and data suggests that a universal macroevolutionary tradeoff governs body size evolution for all mammals, regardless of their habitat. The dramatic sizes of cetaceans can thus be attributed mainly to the increased convective heat loss is water, which shifts the species size distribution upward and pushes its right tail into ranges inaccessible to terrestrial mammals. Under this macroevolutionary tradeoff, the largest expected species occurs where the rate at which smaller-bodied species move up into large-bodied niches approximately equals the rate at which extinction removes them.

preprint2013arXiv

Social Network Dynamics in a Massive Online Game: Network Turnover, Non-densification, and Team Engagement in Halo Reach

Online multiplayer games are a popular form of social interaction, used by hundreds of millions of individuals. However, little is known about the social networks within these online games, or how they evolve over time. Understanding human social dynamics within massive online games can shed new light on social interactions in general and inform the development of more engaging systems. Here, we study a novel, large friendship network, inferred from nearly 18 billion social interactions over 44 weeks between 17 million individuals in the popular online game Halo: Reach. This network is one of the largest, most detailed temporal interaction networks studied to date, and provides a novel perspective on the dynamics of online friendship networks, as opposed to mere interaction graphs. Initially, this network exhibits strong structural turnover and decays rapidly from a peak size. In the following period, however, both network size and turnover stabilize, producing a dynamic structural equilibrium. In contrast to other studies, we find that the Halo friendship network is non-densifying: both the mean degree and the average pairwise distance are stable, suggesting that densification cannot occur when maintaining friendships is costly. Finally, players with greater long-term engagement exhibit stronger local clustering, suggesting a group-level social engagement process. These results demonstrate the utility of online games for studying social networks, shed new light on empirical temporal graph patterns, and clarify the claims of universality of network densification.

preprint2012arXiv

Persistence and periodicity in a dynamic proximity network

The topology of social networks can be understood as being inherently dynamic, with edges having a distinct position in time. Most characterizations of dynamic networks discretize time by converting temporal information into a sequence of network "snapshots" for further analysis. Here we study a highly resolved data set of a dynamic proximity network of 66 individuals. We show that the topology of this network evolves over a very broad distribution of time scales, that its behavior is characterized by strong periodicities driven by external calendar cycles, and that the conversion of inherently continuous-time data into a sequence of snapshots can produce highly biased estimates of network structure. We suggest that dynamic social networks exhibit a natural time scale Δ_{nat}, and that the best conversion of such dynamic data to a discrete sequence of networks is done at this natural rate.

preprint2012arXiv

The developmental dynamics of terrorist organizations

We identify robust statistical patterns in the frequency and severity of violent attacks by terrorist organizations as they grow and age. Using group-level static and dynamic analyses of terrorist events worldwide from 1968-2008 and a simulation model of organizational dynamics, we show that the production of violent events tends to accelerate with increasing size and experience. This coupling of frequency, experience and size arises from a fundamental positive feedback loop in which attacks lead to growth which leads to increased production of new attacks. In contrast, event severity is independent of both size and experience. Thus larger, more experienced organizations are more deadly because they attack more frequently, not because their attacks are more deadly, and large events are equally likely to come from large and small organizations. These results hold across political ideologies and time, suggesting that the frequency and severity of terrorism may be constrained by fundamental processes.

preprint2011arXiv

Adapting to Non-stationarity with Growing Expert Ensembles

When dealing with time series with complex non-stationarities, low retrospective regret on individual realizations is a more appropriate goal than low prospective risk in expectation. Online learning algorithms provide powerful guarantees of this form, and have often been proposed for use with non-stationary processes because of their ability to switch between different forecasters or ``experts''. However, existing methods assume that the set of experts whose forecasts are to be combined are all given at the start, which is not plausible when dealing with a genuinely historical or evolutionary system. We show how to modify the ``fixed shares'' algorithm for tracking the best expert to cope with a steadily growing set of experts, obtained by fitting new models to new data as it becomes available, and obtain regret bounds for the growing ensemble.

preprint2010arXiv

A generalized aggregation-disintegration model for the frequency of severe terrorist attacks

We present and analyze a model of the frequency of severe terrorist attacks, which generalizes the recently proposed model of Johnson et al. This model, which is based on the notion of self-organized criticality and which describes how terrorist cells might aggregate and disintegrate over time, predicts that the distribution of attack severities should follow a power-law form with an exponent of alpha=5/2. This prediction is in good agreement with current empirical estimates for terrorist attacks worldwide, which give alpha=2.4 \pm 0.2, and which we show is independent of certain details of the model. We close by discussing the utility of this model for understanding terrorism and the behavior of terrorist organizations, and mention several productive ways it could be extended mathematically or tested empirically.

preprint2010arXiv

The performance of modularity maximization in practical contexts

Although widely used in practice, the behavior and accuracy of the popular module identification technique called modularity maximization is not well understood in practical contexts. Here, we present a broad characterization of its performance in such situations. First, we revisit and clarify the resolution limit phenomenon for modularity maximization. Second, we show that the modularity function Q exhibits extreme degeneracies: it typically admits an exponential number of distinct high-scoring solutions and typically lacks a clear global maximum. Third, we derive the limiting behavior of the maximum modularity Q_max for one model of infinitely modular networks, showing that it depends strongly both on the size of the network and on the number of modules it contains. Finally, using three real-world metabolic networks as examples, we show that the degenerate solutions can fundamentally disagree on many, but not all, partition properties such as the composition of the largest modules and the distribution of module sizes. These results imply that the output of any modularity maximization procedure should be interpreted cautiously in scientific contexts. They also explain why many heuristics are often successful at finding high-scoring partitions in practice and why different heuristics can disagree on the modular structure of the same network. We conclude by discussing avenues for mitigating some of these behaviors, such as combining information from many degenerate solutions or using generative models.

preprint2004arXiv

Traceroute sampling makes random graphs appear to have power law degree distributions

The topology of the Internet has typically been measured by sampling traceroutes, which are roughly shortest paths from sources to destinations. The resulting measurements have been used to infer that the Internet's degree distribution is scale-free; however, many of these measurements have relied on sampling traceroutes from a small number of sources. It was recently argued that sampling in this way can introduce a fundamental bias in the degree distribution, for instance, causing random (Erdos-Renyi) graphs to appear to have power law degree distributions. We explain this phenomenon analytically using differential equations to model the growth of a breadth-first tree in a random graph G(n,p=c/n) of average degree c, and show that sampling from a single source gives an apparent power law degree distribution P(k) ~ 1/k for k < c.

Aaron Clauset

What is connected

Connect this record

See the researcher in context

Building this map preview

33 published item(s)

An Open-Source Cultural Consensus Approach to Name-Based Gender Classification

Labor advantages drive the greater productivity of faculty at elite universities

Subfield prestige and gender inequality in computing

The Dynamics of Faculty Hiring Networks

Evaluating Overfit and Underfit in Models of Network Community Structure

Eigenvector-Based Centrality Measures for Temporal Networks

Gender, Productivity, and Prestige in Computer Science Faculty Hiring Networks

Assembling thefacebook: Using heterogeneity to understand online social network assembly

Detectability thresholds and optimal algorithms for community structure in dynamic networks

Predicting sports scoring dynamics with restoration and anti-persistence

Structure and inference in annotated networks

Untangling the roles of parasites in food webs with generative network models

A unified view of generative models for networks: models, methods, opportunities, and challenges

Detecting change points in the large-scale structure of evolving networks

Efficiently inferring community structure in bipartite networks

Estimating the historical and future probabilities of large terrorist events

Learning Latent Block Structure in Weighted Networks

Power-law distributions in binned empirical data

Rejoinder of "Estimating the historical and future probabilities of large terrorist events" by Aaron Clauset and Ryan Woodard

Scoring dynamics across professional team sports: tempo, balance and predictability

A network approach to analyzing highly recombinant malaria parasite genes

Adapting the Stochastic Block Model to Edge-Weighted Networks

Detecting Friendship Within Dynamic Online Interaction Networks

Environmental structure and competitive scoring advantages in team competitions

Friends FTW! Friendship, Collaboration and Competition in Halo: Reach

How large should whales be?

Social Network Dynamics in a Massive Online Game: Network Turnover, Non-densification, and Team Engagement in Halo Reach

Persistence and periodicity in a dynamic proximity network

The developmental dynamics of terrorist organizations

Adapting to Non-stationarity with Growing Expert Ensembles

A generalized aggregation-disintegration model for the frequency of severe terrorist attacks

The performance of modularity maximization in practical contexts

Traceroute sampling makes random graphs appear to have power law degree distributions