Source author record

Cosma Rohilla Shalizi

Cosma Rohilla Shalizi appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

30works

22topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Simple Non-Stationary Mean Ergodic Theorem, with Bonus Weak Law of Large Numbers

This brief pedagogical note re-proves a simple theorem on the convergence, in $L_2$ and in probability, of time averages of non-stationary time series to the mean of expectation values. The basic condition is that the sum of covariances grows sub-quadratically with the length of the time series. I make no claim to originality; the result is widely, but unevenly, spread bit of folklore among users of applied probability. The goal of this note is merely to even out that distribution.

preprint2022arXiv

Evaluating Posterior Distributions by Selectively Breeding Prior Samples

Using Markov chain Monte Carlo to sample from posterior distributions was the key innovation which made Bayesian data analysis practical. Notoriously, however, MCMC is hard to tune, hard to diagnose, and hard to parallelize. This pedagogical note explores variants on a universal {\em non}-Markov-chain Monte Carlo scheme for sampling from posterior distributions. The basic idea is to draw parameter values from the prior distributions, evaluate the likelihood of each draw, and then copy that draw a number of times proportional to its likelihood. The distribution after copying is an approximation to the posterior which becomes exact as the number of initial samples goes to infinity; the convergence of the approximation is easily analyzed, and is uniform over Glivenko-Cantelli classes. While not {\em entirely} practical, the schemes are straightforward to implement (a few lines of R), easily parallelized, and require no rejection, burn-in, convergence diagnostics, or tuning of any control settings. I provide references to the prior art which deals with some of the practical obstacles, at some cost in computational and analytical simplicity.

preprint2016arXiv

Estimating beta-mixing coefficients via histograms

The literature on statistical learning for time series often assumes asymptotic independence or "mixing" of the data-generating process. These mixing assumptions are never tested, nor are there methods for estimating mixing coefficients from data. Additionally, for many common classes of processes (Markov processes, ARMA processes, etc.) general functional forms for various mixing rates are known, but not specific coefficients. We present the first estimator for beta-mixing coefficients based on a single stationary sample path and show that it is risk consistent. Since mixing rates depend on infinite-dimensional dependence, we use a Markov approximation based on only a finite memory length $d$. We present convergence rates for the Markov approximation and show that as $d\rightarrow\infty$, the Markov approximation converges to the true mixing coefficient. Our estimator is constructed using $d$-dimensional histogram density estimates. Allowing asymptotics in the bandwidth as well as the dimension, we prove $L^1$ concentration for the histogram as an intermediate step. Simulations wherein the mixing rates are calculable and a real-data example demonstrate our methodology.

preprint2016arXiv

Nonparametric risk bounds for time-series forecasting

We derive generalization error bounds for traditional time-series forecasting models. Our results hold for many standard forecasting tools including autoregressive models, moving average models, and, more generally, linear state-space models. These non-asymptotic bounds need only weak assumptions on the data-generating process, yet allow forecasters to select among competing models and to guarantee, with high probability, that their chosen model will perform well. We motivate our techniques with and apply them to standard economic and financial forecasting tools---a GARCH model for predicting equity volatility and a dynamic stochastic general equilibrium model (DSGE), the standard tool in macroeconomic forecasting. We demonstrate in particular how our techniques can aid forecasters and policy makers in choosing models which behave well under uncertainty and mis-specification.

preprint2016arXiv

Regularized brain reading with shrinkage and smoothing

Functional neuroimaging measures how the brain responds to complex stimuli. However, sample sizes are modest, noise is substantial, and stimuli are high dimensional. Hence, direct estimates are inherently imprecise and call for regularization. We compare a suite of approaches which regularize via shrinkage: ridge regression, the elastic net (a generalization of ridge regression and the lasso), and a hierarchical Bayesian model based on small area estimation (SAE). We contrast regularization with spatial smoothing and combinations of smoothing and shrinkage. All methods are tested on functional magnetic resonance imaging (fMRI) data from multiple subjects participating in two different experiments related to reading, for both predicting neural response to stimuli and decoding stimuli from responses. Interestingly, when the regularization parameters are chosen by cross-validation independently for every voxel, low/high regularization is chosen in voxels where the classification accuracy is high/low, indicating that the regularization intensity is a good tool for identification of relevant voxels for the cognitive task. Surprisingly, all the regularization methods work about equally well, suggesting that beating basic smoothing and shrinkage will take not only clever methods, but also careful modeling.

preprint2016arXiv

The LICORS Cabinet: Nonparametric Algorithms for Spatio-temporal Prediction

Spatio-temporal data is intrinsically high dimensional, so unsupervised modeling is only feasible if we can exploit structure in the process. When the dynamics are local in both space and time, this structure can be exploited by splitting the global field into many lower-dimensional "light cones". We review light cone decompositions for predictive state reconstruction, introducing three simple light cone algorithms. These methods allow for tractable inference of spatio-temporal data, such as full-frame video. The algorithms make few assumptions on the underlying process yet have good predictive performance and can provide distributions over spatio-temporal data, enabling sophisticated probabilistic inference.

preprint2014arXiv

Geometric Network Comparison

Network analysis has a crucial need for tools to compare networks and assess the significance of differences between networks. We propose a principled statistical approach to network comparison that approximates networks as probability distributions on negatively curved manifolds. We outline the theory, as well as implement the approach on simulated networks.

preprint2013arXiv

Consistency under sampling of exponential random graph models

The growing availability of network data and of scientific interest in distributed systems has led to the rapid development of statistical models of network structure. Typically, however, these are models for the entire network, while the data consists only of a sampled sub-network. Parameters for the whole network, which is what is of interest, are estimated by applying the model to the sub-network. This assumes that the model is consistent under sampling, or, in terms of the theory of stochastic processes, that it defines a projective family. Focusing on the popular class of exponential random graph models (ERGMs), we show that this apparently trivial condition is in fact violated by many popular and scientifically appealing models, and that satisfying it drastically limits ERGM's expressive power. These results are actually special cases of more general results about exponential families of dependent random variables, which we also prove. Using such results, we offer easily checked conditions for the consistency of maximum likelihood estimation in ERGMs, and discuss some possible constructive responses.

preprint2013arXiv

Mixed LICORS: A Nonparametric Algorithm for Predictive State Reconstruction

We introduce 'mixed LICORS', an algorithm for learning nonlinear, high-dimensional dynamics from spatio-temporal data, suitable for both prediction and simulation. Mixed LICORS extends the recent LICORS algorithm (Goerg and Shalizi, 2012) from hard clustering of predictive distributions to a non-parametric, EM-like soft clustering. This retains the asymptotic predictive optimality of LICORS, but, as we show in simulations, greatly improves out-of-sample forecasts with limited data. The new method is implemented in the publicly-available R package "LICORS" (http://cran.r-project.org/web/packages/LICORS/).

preprint2013arXiv

Model Selection for Degree-corrected Block Models

The proliferation of models for networks raises challenging problems of model selection: the data are sparse and globally dependent, and models are typically high-dimensional and have large numbers of latent variables. Together, these issues mean that the usual model-selection criteria do not work properly for networks. We illustrate these challenges, and show one way to resolve them, by considering the key network-analysis problem of dividing a graph into communities or blocks of nodes with homogeneous patterns of links to the rest of the network. The standard tool for doing this is the stochastic block model, under which the probability of a link between two nodes is a function solely of the blocks to which they belong. This imposes a homogeneous degree distribution within each block; this can be unrealistic, so degree-corrected block models add a parameter for each node, modulating its over-all degree. The choice between ordinary and degree-corrected block models matters because they make very different inferences about communities. We present the first principled and tractable approach to model selection between standard and degree-corrected block models, based on new large-graph asymptotics for the distribution of log-likelihood ratios under the stochastic block model, finding substantial departures from classical results for sparse graphs. We also develop linear-time approximations for log-likelihoods under both the stochastic block model and the degree-corrected model, using belief propagation. Applications to simulated and real networks show excellent agreement with our approximations. Our results thus both solve the practical problem of deciding on degree correction, and point to a general approach to model selection in network analysis.

preprint2013arXiv

Predictive PAC Learning and Process Decompositions

We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This definition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular ($β$-mixing) processes, of independent probability-theoretic interest.

preprint2012arXiv

LICORS: Light Cone Reconstruction of States for Non-parametric Forecasting of Spatio-Temporal Systems

We present a new, non-parametric forecasting method for data where continuous values are observed discretely in space and time. Our method, "light-cone reconstruction of states" (LICORS), uses physical principles to identify predictive states which are local properties of the system, both in space and time. LICORS discovers the number of predictive states and their predictive distributions automatically, and consistently, under mild assumptions on the data source. We provide an algorithm to implement our method, along with a cross-validation scheme to pick control settings. Simulations show that CV-tuned LICORS outperforms standard methods in forecasting challenging spatio-temporal dynamics. Our work provides applied researchers with a new, highly automatic method to analyze and forecast spatio-temporal data.

preprint2011arXiv

Adapting to Non-stationarity with Growing Expert Ensembles

When dealing with time series with complex non-stationarities, low retrospective regret on individual realizations is a more appropriate goal than low prospective risk in expectation. Online learning algorithms provide powerful guarantees of this form, and have often been proposed for use with non-stationary processes because of their ability to switch between different forecasters or ``experts''. However, existing methods assume that the set of experts whose forecasts are to be combined are all given at the start, which is not plausible when dealing with a genuinely historical or evolutionary system. We show how to modify the ``fixed shares'' algorithm for tracking the best expert to cope with a steadily growing set of experts, obtained by fitting new models to new data as it becomes available, and obtain regret bounds for the growing ensemble.

preprint2011arXiv

Estimated VC dimension for risk bounds

Vapnik-Chervonenkis (VC) dimension is a fundamental measure of the generalization capacity of learning algorithms. However, apart from a few special cases, it is hard or impossible to calculate analytically. Vapnik et al. [10] proposed a technique for estimating the VC dimension empirically. While their approach behaves well in simulations, it could not be used to bound the generalization risk of classifiers, because there were no bounds for the estimation error of the VC dimension itself. We rectify this omission, providing high probability concentration results for the proposed estimator and deriving corresponding generalization bounds.

preprint2011arXiv

Estimating $β$-mixing coefficients

The literature on statistical learning for time series assumes the asymptotic independence or ``mixing' of the data-generating process. These mixing assumptions are never tested, nor are there methods for estimating mixing rates from data. We give an estimator for the $β$-mixing rate based on a single stationary sample path and show it is $L_1$-risk consistent.

preprint2011arXiv

Generalization error bounds for stationary autoregressive models

We derive generalization error bounds for stationary univariate autoregressive (AR) models. We show that imposing stationarity is enough to control the Gaussian complexity without further regularization. This lets us use structural risk minimization for model selection. We demonstrate our methods by predicting interest rate movements.

preprint2011arXiv

Philosophy and the practice of Bayesian statistics

A substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics. We argue that the most successful forms of Bayesian statistics do not actually support that particular philosophy but rather accord much better with sophisticated forms of hypothetico-deductivism. We examine the actual role played by prior distributions in Bayesian models, and the crucial aspects of model checking and model revision, which fall outside the scope of Bayesian confirmation theory. We draw on the literature on the consistency of Bayesian updating and also on our experience of applied work in social science. Clarity about these matters should benefit not just philosophy of science, but also statistical practice. At best, the inductivist view has encouraged researchers to fit and compare models without checking them; at worst, theorists have actively discouraged practitioners from performing model checking because it does not fit into their framework.

preprint2011arXiv

Scaling and Hierarchy in Urban Economies

In several recent publications, Bettencourt, West and collaborators claim that properties of cities such as gross economic production, personal income, numbers of patents filed, number of crimes committed, etc., show super-linear power-scaling with total population, while measures of resource use show sub-linear power-law scaling. Re-analysis of the gross economic production and personal income for cities in the United States, however, shows that the data cannot distinguish between power laws and other functional forms, including logarithmic growth, and that size predicts relatively little of the variation between cities. The striking appearance of scaling in previous work is largely artifact of using extensive quantities (city-wide totals) rather than intensive ones (per-capita rates). The remaining dependence of productivity on city size is explained by concentration of specialist service industries, with high value-added per worker, in larger cities, in accordance with the long-standing economic notion of the "hierarchy of central places".

preprint2010arXiv

Approximate Methods for State-Space Models

State-space models provide an important body of techniques for analyzing time-series, but their use requires estimating unobserved states. The optimal estimate of the state is its conditional expectation given the observation histories, and computing this expectation is hard when there are nonlinearities. Existing filtering methods, including sequential Monte Carlo, tend to be either inaccurate or slow. In this paper, we study a nonlinear filter for nonlinear/non-Gaussian state-space models, which uses Laplace's method, an asymptotic series expansion, to approximate the state's conditional mean and variance, together with a Gaussian conditional distribution. This {\em Laplace-Gaussian filter} (LGF) gives fast, recursive, deterministic state estimates, with an error which is set by the stochastic characteristics of the model and is, we show, stable over time. We illustrate the estimation ability of the LGF by applying it to the problem of neural decoding and compare it to sequential Monte Carlo both in simulations and with real data. We find that the LGF can deliver superior results in a small fraction of the computing time.

preprint2010arXiv

Homophily and Contagion Are Generically Confounded in Observational Social Network Studies

We consider processes on social networks that can potentially involve three factors: homophily, or the formation of social ties due to matching individual traits; social contagion, also known as social influence; and the causal effect of an individual's covariates on their behavior or other measurable responses. We show that, generically, all of these are confounded with each other. Distinguishing them from one another requires strong assumptions on the parametrization of the social process or on the adequacy of the covariates used (or both). In particular we demonstrate, with simple examples, that asymmetries in regression coefficients cannot identify causal effects, and that very simple models of imitation (a form of social contagion) can produce substantial correlations between an individual's enduring traits and their choices, even when there is no intrinsic affinity between them. We also suggest some possible constructive responses to these results.

preprint2009arXiv

Dynamics of Bayesian Updating with Dependent Data and Misspecified Models

Much is now known about the consistency of Bayesian updating on infinite-dimensional parameter spaces with independent or Markovian data. Necessary conditions for consistency include the prior putting enough weight on the correct neighborhoods of the data-generating distribution; various sufficient conditions further restrict the prior in ways analogous to capacity control in frequentist nonparametrics. The asymptotics of Bayesian updating with mis-specified models or priors, or non-Markovian data, are far less well explored. Here I establish sufficient conditions for posterior convergence when all hypotheses are wrong, and the data have complex dependencies. The main dynamical assumption is the asymptotic equipartition (Shannon-McMillan-Breiman) property of information theory. This, along with Egorov's Theorem on uniform convergence, lets me build a sieve-like structure for the prior. The main statistical assumption, also a form of capacity control, concerns the compatibility of the prior and the data-generating process, controlling the fluctuations in the log-likelihood when averaged over the sieve-like sets. In addition to posterior convergence, I derive a kind of large deviations principle for the posterior measure, extending in some cases to rates of convergence, and discuss the advantages of predicting using a combination of models known to be wrong. An appendix sketches connections between these results and the replicator dynamics of evolutionary theory.

preprint2009arXiv

The Computational Structure of Spike Trains

Neurons perform computations, and convey the results of those computations through the statistical structure of their output spike trains. Here we present a practical method, grounded in the information-theoretic analysis of prediction, for inferring a minimal representation of that structure and for characterizing its complexity. Starting from spike trains, our approach finds their causal state models (CSMs), the minimal hidden Markov models or stochastic automata capable of generating statistically identical time series. We then use these CSMs to objectively quantify both the generalizable structure and the idiosyncratic randomness of the spike train. Specifically, we show that the expected algorithmic information content (the information needed to describe the spike train exactly) can be split into three parts describing (1) the time-invariant structure (complexity) of the minimal spike-generating process, which describes the spike train statistically; (2) the randomness (internal entropy rate) of the minimal spike-generating process; and (3) a residual pure noise term not described by the minimal spike-generating process. We use CSMs to approximate each of these quantities. The CSMs are inferred nonparametrically from the data, making only mild regularity assumptions, via the causal state splitting reconstruction algorithm. The methods presented here complement more traditional spike train analyses by describing not only spiking probability and spike train entropy, but also the complexity of a spike train's structure. We demonstrate our approach using both simulated spike trains and experimental data recorded in rat barrel cortex during vibrissa stimulation.

preprint2006arXiv

Discovering Functional Communities in Dynamical Networks

Many networks are important because they are substrates for dynamical systems, and their pattern of functional connectivity can itself be dynamic -- they can functionally reorganize, even if their underlying anatomical structure remains fixed. However, the recent rapid progress in discovering the community structure of networks has overwhelmingly focused on that constant anatomical connectivity. In this paper, we lay out the problem of discovering_functional communities_, and describe an approach to doing so. This method combines recent work on measuring information sharing across stochastic networks with an existing and successful community-discovery algorithm for weighted networks. We illustrate it with an application to a large biophysical model of the transition from beta to gamma rhythms in the hippocampus.

preprint2005arXiv

Automatic Filters for the Detection of Coherent Structure in Spatiotemporal Systems

Most current methods for identifying coherent structures in spatially-extended systems rely on prior information about the form which those structures take. Here we present two new approaches to automatically filter the changing configurations of spatial dynamical systems and extract coherent structures. One, local sensitivity filtering, is a modification of the local Lyapunov exponent approach suitable to cellular automata and other discrete spatial systems. The other, local statistical complexity filtering, calculates the amount of information needed for optimal prediction of the system's behavior in the vicinity of a given point. By examining the changing spatiotemporal distributions of these quantities, we can find the coherent structures in a variety of pattern-forming cellular automata, without needing to guess or postulate the form of that structure. We apply both filters to elementary and cyclical cellular automata (ECA and CCA) and find that they readily identify particles, domains and other more complicated structures. We compare the results from ECA with earlier ones based upon the theory of formal languages, and the results from CCA with a more traditional approach based on an order parameter and free energy. While sensitivity and statistical complexity are equally adept at uncovering structure, they are based on different system properties (dynamical and probabilistic, respectively), and provide complementary information.

preprint2005arXiv

Measuring Shared Information and Coordinated Activity in Neuronal Networks

Most nervous systems encode information about stimuli in the responding activity of large neuronal networks. This activity often manifests itself as dynamically coordinated sequences of action potentials. Since multiple electrode recordings are now a standard tool in neuroscience research, it is important to have a measure of such network-wide behavioral coordination and information sharing, applicable to multiple neural spike train data. We propose a new statistic, informational coherence, which measures how much better one unit can be predicted by knowing the dynamical state of another. We argue informational coherence is a measure of association and shared information which is superior to traditional pairwise measures of synchronization and correlation. To find the dynamical states, we use a recently-introduced algorithm which reconstructs effective state spaces from stochastic time series. We then extend the pairwise measure to a multivariate analysis of the network by estimating the network multi-information. We illustrate our method by testing it on a detailed model of the transition from gamma to beta rhythms.

preprint2004arXiv

Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences

We present a new method for nonlinear prediction of discrete random sequences under minimal structural assumptions. We give a mathematical construction for optimal predictors of such processes, in the form of hidden Markov models. We then describe an algorithm, CSSR (Causal-State Splitting Reconstruction), which approximates the ideal predictor from data. We discuss the reliability of CSSR, its data requirements, and its performance in simulations. Finally, we compare our approach to existing methods using variable-length Markov models and cross-validated hidden Markov models, and show theoretically and experimentally that our method delivers results superior to the former and at least comparable to the latter.

preprint2004arXiv

Quantifying Self-Organization with Optimal Predictors

Despite broad interest in self-organizing systems, there are few quantitative, experimentally-applicable criteria for self-organization. The existing criteria all give counter-intuitive results for important cases. In this Letter, we propose a new criterion, namely an internally-generated increase in the statistical complexity, the amount of information required for optimal prediction of the system's dynamics. We precisely define this complexity for spatially-extended dynamical systems, using the probabilistic ideas of mutual information and minimal sufficient statistics. This leads to a general method for predicting such systems, and a simple algorithm for estimating statistical complexity. The results of applying this algorithm to a class of models of excitable media (cyclic cellular automata) strongly support our proposal.

preprint2003arXiv

Optimal Nonlinear Prediction of Random Fields on Networks

It is increasingly common to encounter time-varying random fields on networks (metabolic networks, sensor arrays, distributed computing, etc.). This paper considers the problem of optimal, nonlinear prediction of these fields, showing from an information-theoretic perspective that it is formally identical to the problem of finding minimal local sufficient statistics. I derive general properties of these statistics, show that they can be composed into global predictors, and explore their recursive estimation properties. For the special case of discrete-valued fields, I describe a convergent algorithm to identify the local predictors from empirical data, with minimal prior information about the field, and no distributional assumptions.

preprint2000arXiv

Computational Mechanics: Pattern and Prediction, Structure and Simplicity

Computational mechanics, an approach to structural complexity, defines a process's causal states and gives a procedure for finding them. We show that the causal-state representation--an $ε$-machine--is the minimal one consistent with accurate prediction. We establish several results on $ε$-machine optimality and uniqueness and on how $ε$-machines compare to alternative representations. Further results relate measures of randomness and structural complexity obtained from $ε$-machines to those from ergodic and information theories.

preprint2000arXiv

Information Bottlenecks, Causal States, and Statistical Relevance Bases: How to Represent Relevant Information in Memoryless Transduction

Discovering relevant, but possibly hidden, variables is a key step in constructing useful and predictive theories about the natural world. This brief note explains the connections between three approaches to this problem: the recently introduced information-bottleneck method, the computational mechanics approach to inferring optimal models, and Salmon's statistical relevance basis.

Cosma Rohilla Shalizi

What is connected

Connect this record

See the researcher in context

Building this map preview

30 published item(s)

A Simple Non-Stationary Mean Ergodic Theorem, with Bonus Weak Law of Large Numbers

Evaluating Posterior Distributions by Selectively Breeding Prior Samples

Estimating beta-mixing coefficients via histograms

Nonparametric risk bounds for time-series forecasting

Regularized brain reading with shrinkage and smoothing

The LICORS Cabinet: Nonparametric Algorithms for Spatio-temporal Prediction

Geometric Network Comparison

Consistency under sampling of exponential random graph models

Mixed LICORS: A Nonparametric Algorithm for Predictive State Reconstruction

Model Selection for Degree-corrected Block Models

Predictive PAC Learning and Process Decompositions

LICORS: Light Cone Reconstruction of States for Non-parametric Forecasting of Spatio-Temporal Systems

Adapting to Non-stationarity with Growing Expert Ensembles

Estimated VC dimension for risk bounds

Estimating $β$-mixing coefficients

Generalization error bounds for stationary autoregressive models

Philosophy and the practice of Bayesian statistics

Scaling and Hierarchy in Urban Economies

Approximate Methods for State-Space Models

Homophily and Contagion Are Generically Confounded in Observational Social Network Studies

Dynamics of Bayesian Updating with Dependent Data and Misspecified Models

The Computational Structure of Spike Trains

Discovering Functional Communities in Dynamical Networks

Automatic Filters for the Detection of Coherent Structure in Spatiotemporal Systems

Measuring Shared Information and Coordinated Activity in Neuronal Networks

Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences

Quantifying Self-Organization with Optimal Predictors

Optimal Nonlinear Prediction of Random Fields on Networks

Computational Mechanics: Pattern and Prediction, Structure and Simplicity

Information Bottlenecks, Causal States, and Statistical Relevance Bases: How to Represent Relevant Information in Memoryless Transduction