Source author record

Maxim Raginsky

Maxim Raginsky appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT Machine Learning math.OC math.PR Systems and Control math.ST Statistics Theory eess.SY Applications Artificial Intelligence Computer Science and Game Theory Distributed, Parallel, and Cluster Computing math.DS

Catalog footprint

What is connected

37works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A Behavioral Framework for Data-Driven Modeling of Nonlinear Systems in Vector-Valued Reproducing Kernel Hilbert Spaces

We generalize Jan Willems' behavioral approach to a class of discrete-time nonlinear systems in a vector-valued reproducing kernel Hilbert space (RKHS). Apart from linear time-invariant systems, this class covers nonlinear systems modeled by Volterra series and their autoregressive variants, as well as systems admitting Hammerstein-type state-space realizations. We apply the proposed framework to the problem of data-driven modeling of such systems, i.e., when simulation or control objectives for an unknown system are carried out without an explicit system identification step. To that end, we link the behavioral approach to two data-driven modeling methods in a vector-valued RKHS: (1) minimum-norm interpolation and (2) subspace identification.

preprint2026arXiv

Information-theoretic Limits of Learning and Estimation

Information theory plays a central role in establishing fundamental limits on what any learning or estimation algorithm can -- and cannot -- achieve, regardless of computational power. In this chapter, we provide an introduction to these connections. End-of-chapter exercises makes the material suitable for both classroom use and self-study. We begin by introducing concentration inequalities along with the notions of covering and packing in metric spaces, and the associated concept of metric entropy. These tools are essential for our analysis. We then introduce the learning-theoretic framework and derive upper bounds on generalization error in terms of metric entropy, Rademacher complexity, and the VC dimension, as well as mutual information and relative entropy. Finally we discuss the minimax estimation framework and establish lower bounds on minimax risk using Fano's inequality, yielding bounds in terms of relative entropy and covering and packing numbers. This manuscript contains preprint of a chapter under consideration for inclusion in the forthcoming third edition of Cover and Thomas's Elements of Information Theory, posted with permission from Wiley. It would follow the chapter posted at arXiv:2605.02989 . The table of contents of the new edition can be found at: https://docs.google.com/document/d/1L-m4oQEJw1PJhoxBeMwrrBD8S_HmvzMEkPbYvS24980/edit?usp=sharing . For feedback, please contact abbas@ee.stanford.edu.

preprint2026arXiv

Spatially-Coupled Network RNA Velocities: A Control-Theoretic Perspective

RNA velocity is an important model that combines cellular spliced and unspliced RNA counts to infer dynamical properties of various regulatory functions. Despite its wide applicability and many variants used in practice, the model has not been adequately designed to directly account for both intracellular gene regulatory network interactions and spatial intercellular communications. Here, we propose a new RNA velocity approach that jointly and directly captures two new network structures: an intracellular gene regulatory network (GRN) and an intercellular interaction network that captures interactions between (neighboring) cells, with relevance to spatial transcriptomics. We theoretically analyze this two-level network system through the lens of control and consensus theory. In particular, we investigate network equilibria, stability, cellular network consensus, and optimal control approaches for targeted drug intervention.

preprint2026arXiv

Talagrand Meets Talagrand: Upper and Lower Bounds on Expected Soft Maxima of Gaussian Processes with Finite Index Sets

Analysis of extremal behavior of stochastic processes is a key ingredient in a wide variety of applications, including probability, statistical physics, theoretical computer science, and learning theory. In this paper, we consider centered Gaussian processes on finite index sets and investigate expected values of their smoothed, or ``soft,'' maxima. We obtain upper and lower bounds for these expected values using a combination of ideas from statistical physics (the Gibbs variational principle for the equilibrium free energy and replica-symmetric representations of Gibbs averages) and from probability theory (Sudakov minoration). These bounds are parametrized by an inverse temperature $β> 0$ and reduce to the usual Gaussian maximal inequalities in the zero-temperature limit $β\to \infty$. We provide an illustration of our methods in the context of the Random Energy Model, one of the simplest models of physical systems with random disorder.

preprint2022arXiv

Fitting an immersed submanifold to data via Sussmann's orbit theorem

This paper describes an approach for fitting an immersed submanifold of a finite-dimensional Euclidean space to random samples. The reconstruction mapping from the ambient space to the desired submanifold is implemented as a composition of an encoder that maps each point to a tuple of (positive or negative) times and a decoder given by a composition of flows along finitely many vector fields starting from a fixed initial point. The encoder supplies the times for the flows. The encoder-decoder map is obtained by empirical risk minimization, and a high-probability bound is given on the excess risk relative to the minimum expected reconstruction error over a given class of encoder-decoder maps. The proposed approach makes fundamental use of Sussmann's orbit theorem, which guarantees that the image of the reconstruction map is indeed contained in an immersed submanifold.

preprint2022arXiv

Input-to-State Stable Neural Ordinary Differential Equations with Applications to Transient Modeling of Circuits

This paper proposes a class of neural ordinary differential equations parametrized by provably input-to-state stable continuous-time recurrent neural networks. The model dynamics are defined by construction to be input-to-state stable (ISS) with respect to an ISS-Lyapunov function that is learned jointly with the dynamics. We use the proposed method to learn cheap-to-simulate behavioral models for electronic circuits that can accurately reproduce the behavior of various digital and analog circuits when simulated by a commercial circuit simulator, even when interconnected with circuit components not encountered during training. We also demonstrate the feasibility of learning ISS-preserving perturbations to the dynamics for modeling degradation effects due to circuit aging.

preprint2021arXiv

Minimum Excess Risk in Bayesian Learning

We analyze the best achievable performance of Bayesian learning under generative models by defining and upper-bounding the minimum excess risk (MER): the gap between the minimum expected loss attainable by learning from data and the minimum expected loss that could be achieved if the model realization were known. The definition of MER provides a principled way to define different notions of uncertainties in Bayesian learning, including the aleatoric uncertainty and the minimum epistemic uncertainty. Two methods for deriving upper bounds for the MER are presented. The first method, generally suitable for Bayesian learning with a parametric generative model, upper-bounds the MER by the conditional mutual information between the model parameters and the quantity being predicted given the observed data. It allows us to quantify the rate at which the MER decays to zero as more data becomes available. Under realizable models, this method also relates the MER to the richness of the generative function class, notably the VC dimension in binary classification. The second method, particularly suitable for Bayesian learning with a parametric predictive model, relates the MER to the minimum estimation error of the model parameters from data. It explicitly shows how the uncertainty in model parameter estimation translates to the MER and to the final prediction uncertainty. We also extend the definition and analysis of MER to the setting with multiple model families and the setting with nonparametric models. Along the discussions we draw some comparisons between the MER in Bayesian learning and the excess risk in frequentist learning.

preprint2020arXiv

Model-Augmented Estimation of Conditional Mutual Information for Feature Selection

Markov blanket feature selection, while theoretically optimal, is generally challenging to implement. This is due to the shortcomings of existing approaches to conditional independence (CI) testing, which tend to struggle either with the curse of dimensionality or computational complexity. We propose a novel two-step approach which facilitates Markov blanket feature selection in high dimensions. First, neural networks are used to map features to low-dimensional representations. In the second step, CI testing is performed by applying the $k$-NN conditional mutual information estimator to the learned feature maps. The mappings are designed to ensure that mapped samples both preserve information and share similar information about the target variable if and only if they are close in Euclidean distance. We show that these properties boost the performance of the $k$-NN estimator in the second step. The performance of the proposed method is evaluated on both synthetic and real data.

preprint2020arXiv

Non-signaling Approximations of Stochastic Team Problems

In this paper, we consider non-signaling approximation of finite stochastic teams. We first introduce a hierarchy of team decision rules that can be classified in an increasing order as randomized policies, quantum-correlated policies, and non-signaling policies. Then, we establish an approximation of team-optimal policies for sequential teams via extendible non-signaling policies. We prove that the distance between extendible non-signaling policies and decentralized policies is small if the extension is sufficiently large. Using this result, we establish a linear programming (LP) approximation of sequential teams. Finally, we state an open problem regarding computation of optimal value of quantum-correlated policies.

preprint2018arXiv

Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability

We study the detailed path-wise behavior of the discrete-time Langevin algorithm for non-convex Empirical Risk Minimization (ERM) through the lens of metastability, adopting some techniques from Berglund and Gentz (2003. For a particular local optimum of the empirical risk, with an arbitrary initialization, we show that, with high probability, at least one of the following two events will occur: (1) the Langevin trajectory ends up somewhere outside the $\varepsilon$-neighborhood of this particular optimum within a short recurrence time; (2) it enters this $\varepsilon$-neighborhood by the recurrence time and stays there until a potentially exponentially long escape time. We call this phenomenon empirical metastability. This two-timescale characterization aligns nicely with the existing literature in the following two senses. First, the effective recurrence time (i.e., number of iterations multiplied by stepsize) is dimension-independent, and resembles the convergence time of continuous-time deterministic Gradient Descent (GD). However unlike GD, the Langevin algorithm does not require strong conditions on local initialization, and has the possibility of eventually visiting all optima. Second, the scaling of the escape time is consistent with the Eyring-Kramers law, which states that the Langevin scheme will eventually visit all local minima, but it will take an exponentially long time to transit among them. We apply this path-wise concentration result in the context of statistical learning to examine local notions of generalization and optimality.

preprint2018arXiv

Minimax Statistical Learning with Wasserstein Distances

As opposed to standard empirical risk minimization (ERM), distributionally robust optimization aims to minimize the worst-case risk over a larger ambiguity set containing the original empirical distribution of the training data. In this work, we describe a minimax framework for statistical learning with ambiguity sets given by balls in Wasserstein space. In particular, we prove generalization bounds that involve the covering number properties of the original ERM problem. As an illustrative example, we provide generalization guarantees for transport-based domain adaptation problems where the Wasserstein distance between the source and target domain distributions can be reliably estimated from unlabeled samples.

preprint2017arXiv

Information-theoretic lower bounds for distributed function computation

We derive information-theoretic converses (i.e., lower bounds) for the minimum time required by any algorithm for distributed function computation over a network of point-to-point channels with finite capacity, where each node of the network initially has a random observation and aims to compute a common function of all observations to a given accuracy with a given confidence by exchanging messages with its neighbors. We obtain the lower bounds on computation time by examining the conditional mutual information between the actual function value and its estimate at an arbitrary node, given the observations in an arbitrary subset of nodes containing that node. The main contributions include: 1) A lower bound on the conditional mutual information via so-called small ball probabilities, which captures the dependence of the computation time on the joint distribution of the observations at the nodes, the structure of the function, and the accuracy requirement. For linear functions, the small ball probability can be expressed by Lévy concentration functions of sums of independent random variables, for which tight estimates are available that lead to strict improvements over existing lower bounds on computation time. 2) An upper bound on the conditional mutual information via strong data processing inequalities, which complements and strengthens existing cutset-capacity upper bounds. 3) A multi-cutset analysis that quantifies the loss (dissipation) of the information needed for computation as it flows across a succession of cutsets in the network. This analysis is based on reducing a general network to a line network with bidirectional links and self-links, and the results highlight the dependence of the computation time on the diameter of the network, a fundamental parameter that is missing from most of the existing lower bounds on computation time.

preprint2016arXiv

Concentration of measure without independence: a unified approach via the martingale method

The concentration of measure phenomenon may be summarized as follows: a function of many weakly dependent random variables that is not too sensitive to any of its individual arguments will tend to take values very close to its expectation. This phenomenon is most completely understood when the arguments are mutually independent random variables, and there exist several powerful complementary methods for proving concentration inequalities, such as the martingale method, the entropy method, and the method of transportation inequalities. The setting of dependent arguments is much less well understood. This chapter focuses on the martingale method for deriving concentration inequalities without independence assumptions. In particular, we use the machinery of so-called Wasserstein matrices to show that the Azuma-Hoeffding concentration inequality for martingales with almost surely bounded differences, when applied in a sufficiently abstract setting, is powerful enough to recover and sharpen several known concentration results for nonproduct measures. Wasserstein matrices provide a natural formalism for capturing the interplay between the metric and the probabilistic structures, which is fundamental to the concentration phenomenon.

preprint2016arXiv

Coordinate Dual Averaging for Decentralized Online Optimization with Nonseparable Global Objectives

We consider a decentralized online convex optimization problem in a network of agents, where each agent controls only a coordinate (or a part) of the global decision vector. For such a problem, we propose two decentralized variants (ODA-C and ODA-PS) of Nesterov's primal-dual algorithm with dual averaging. In ODA-C, to mitigate the disagreements on the primal-vector updates, the agents implement a generalization of the local information-exchange dynamics recently proposed by Li and Marden over a static undirected graph. In ODA-PS, the agents implement the broadcast-based push-sum dynamics over a time-varying sequence of uniformly connected digraphs. We show that the regret bounds in both cases have sublinear growth of $O(\sqrt{T})$, with the time horizon $T$, when the stepsize is of the form $1/\sqrt{t}$ and the objective functions are Lipschitz-continuous convex functions with Lipschitz gradients. We also implement the proposed algorithms on a sensor network to complement our theoretical analysis.

preprint2016arXiv

Information-Theoretic Lower Bounds on Bayes Risk in Decentralized Estimation

We derive lower bounds on the Bayes risk in decentralized estimation, where the estimator does not have direct access to the random samples generated conditionally on the random parameter of interest, but only to the data received from local processors that observe the samples. The received data are subject to communication constraints, due to quantization and the noise in the communication channels from the processors to the estimator. We first derive general lower bounds on the Bayes risk using information-theoretic quantities, such as mutual information, information density, small ball probability, and differential entropy. We then apply these lower bounds to the decentralized case, using strong data processing inequalities to quantify the contraction of information due to communication constraints. We treat the cases of a single processor and of multiple processors, where the samples observed by different processors may be conditionally dependent given the parameter, for noninteractive and interactive communication protocols. Our results recover and improve recent lower bounds on the Bayes risk and the minimax risk for certain decentralized estimation problems, where previously only conditionally independent sample sets and noiseless channels have been considered. Moreover, our results provide a general way to quantify the degradation of estimation performance caused by distributing resources to multiple processors, which is only discussed for specific examples in existing works.

preprint2016arXiv

Rationally inattentive control of Markov processes

The article poses a general model for optimal control subject to information constraints, motivated in part by recent work of Sims and others on information-constrained decision-making by economic agents. In the average-cost optimal control framework, the general model introduced in this paper reduces to a variant of the linear-programming representation of the average-cost optimal control problem, subject to an additional mutual information constraint on the randomized stationary policy. The resulting optimization problem is convex and admits a decomposition based on the Bellman error, which is the object of study in approximate dynamic programming. The theory is illustrated through the example of information-constrained linear-quadratic-Gaussian (LQG) control problem. Some results on the infinite-horizon discounted-cost criterion are also presented.

preprint2016arXiv

Strong data processing inequalities and $Φ$-Sobolev inequalities for discrete channels

The noisiness of a channel can be measured by comparing suitable functionals of the input and output distributions. For instance, the worst-case ratio of output relative entropy to input relative entropy for all possible pairs of input distributions is bounded from above by unity, by the data processing theorem. However, for a fixed reference input distribution, this quantity may be strictly smaller than one, giving so-called strong data processing inequalities (SDPIs). The same considerations apply to an arbitrary $Φ$-divergence. This paper presents a systematic study of optimal constants in SDPIs for discrete channels, including their variational characterizations, upper and lower bounds, structural results for channels on product probability spaces, and the relationship between SDPIs and so-called $Φ$-Sobolev inequalities (another class of inequalities that can be used to quantify the noisiness of a channel by controlling entropy-like functionals of the input distribution by suitable measures of input-output correlation). Several applications to information theory, discrete probability, and statistical physics are discussed.

preprint2015arXiv

Concentration of Measure Inequalities and Their Communication and Information-Theoretic Applications

During the last two decades, concentration of measure has been a subject of various exciting developments in convex geometry, functional analysis, statistical physics, high-dimensional statistics, probability theory, information theory, communications and coding theory, computer science, and learning theory. One common theme which emerges in these fields is probabilistic stability: complicated, nonlinear functions of a large number of independent or weakly dependent random variables often tend to concentrate sharply around their expected values. Information theory plays a key role in the derivation of concentration inequalities. Indeed, both the entropy method and the approach based on transportation-cost inequalities are two major information-theoretic paths toward proving concentration. This brief survey is based on a recent monograph of the authors in the Foundations and Trends in Communications and Information Theory (online available at http://arxiv.org/pdf/1212.4663v8.pdf), and a tutorial given by the authors at ISIT 2015. It introduces information theorists to three main techniques for deriving concentration inequalities: the martingale method, the entropy method, and the transportation-cost inequalities. Some applications in information theory, communications, and coding theory are used to illustrate the main ideas.

preprint2015arXiv

Concentration of Measure Inequalities in Information Theory, Communications and Coding (Second Edition)

During the last two decades, concentration inequalities have been the subject of exciting developments in various areas, including convex geometry, functional analysis, statistical physics, high-dimensional statistics, pure and applied probability theory, information theory, theoretical computer science, and learning theory. This monograph focuses on some of the key modern mathematical tools that are used for the derivation of concentration inequalities, on their links to information theory, and on their various applications to communications and coding. In addition to being a survey, this monograph also includes various new recent results derived by the authors. The first part of the monograph introduces classical concentration inequalities for martingales, as well as some recent refinements and extensions. The power and versatility of the martingale approach is exemplified in the context of codes defined on graphs and iterative decoding algorithms, as well as codes for wireless communication. The second part of the monograph introduces the entropy method, an information-theoretic technique for deriving concentration inequalities. The basic ingredients of the entropy method are discussed first in the context of logarithmic Sobolev inequalities, which underlie the so-called functional approach to concentration of measure, and then from a complementary information-theoretic viewpoint based on transportation-cost inequalities and probability in metric spaces. Some representative results on concentration for dependent random variables are briefly summarized, with emphasis on their connections to the entropy method. Finally, we discuss several applications of the entropy method to problems in communications and coding, including strong converses, empirical distributions of good channel codes, and an information-theoretic converse for concentration of measure.

preprint2015arXiv

Converses for distributed estimation via strong data processing inequalities

We consider the problem of distributed estimation, where local processors observe independent samples conditioned on a common random parameter of interest, map the observations to a finite number of bits, and send these bits to a remote estimator over independent noisy channels. We derive converse results for this problem, such as lower bounds on Bayes risk. The main technical tools include a lower bound on the Bayes risk via mutual information and small ball probability, as well as strong data processing inequalities for the relative entropy. Our results can recover and improve some existing results on distributed estimation with noiseless channels, and also capture the effect of noisy channels on the estimation performance.

preprint2015arXiv

On MMSE estimation from quantized observations in the nonasymptotic regime

This paper studies MMSE estimation on the basis of quantized noisy observations. It presents nonasymptotic bounds on MMSE regret due to quantization for two settings: (1) estimation of a scalar random variable given a quantized vector of $n$ conditionally independent observations, and (2) estimation of a $p$-dimensional random vector given a quantized vector of $n$ observations (not necessarily independent) when the full MMSE estimator has a subgaussian concentration property.

preprint2015arXiv

Online discrete optimization in social networks in the presence of Knightian uncertainty

We study a model of collective real-time decision-making (or learning) in a social network operating in an uncertain environment, for which no a priori probabilistic model is available. Instead, the environment's impact on the agents in the network is seen through a sequence of cost functions, revealed to the agents in a causal manner only after all the relevant actions are taken. There are two kinds of costs: individual costs incurred by each agent and local-interaction costs incurred by each agent and its neighbors in the social network. Moreover, agents have inertia: each agent has a default mixed strategy that stays fixed regardless of the state of the environment, and must expend effort to deviate from this strategy in order to respond to cost signals coming from the environment. We construct a decentralized strategy, wherein each agent selects its action based only on the costs directly affecting it and on the decisions made by its neighbors in the network. In this setting, we quantify social learning in terms of regret, which is given by the difference between the realized network performance over a given time horizon and the best performance that could have been achieved in hindsight by a fictitious centralized entity with full knowledge of the environment's evolution. We show that our strategy achieves the regret that scales polylogarithmically with the time horizon and polynomially with the number of agents and the maximum number of neighbors of any agent in the social network.

preprint2015arXiv

Relax but stay in control: from value to algorithms for online Markov decision processes

Online learning algorithms are designed to perform in non-stationary environments, but generally there is no notion of a dynamic state to model constraints on current and future actions as a function of past actions. State-based models are common in stochastic control settings, but commonly used frameworks such as Markov Decision Processes (MDPs) assume a known stationary environment. In recent years, there has been a growing interest in combining the above two frameworks and considering an MDP setting in which the cost function is allowed to change arbitrarily after each time step. However, most of the work in this area has been algorithmic: given a problem, one would develop an algorithm almost from scratch. Moreover, the presence of the state and the assumption of an arbitrarily varying environment complicate both the theoretical analysis and the development of computationally efficient methods. This paper describes a broad extension of the ideas proposed by Rakhlin et al. to give a general framework for deriving algorithms in an MDP setting with arbitrarily changing costs. This framework leads to a unifying view of existing methods and provides a general procedure for constructing new ones. Several new methods are presented, and one of them is shown to have important advantages over a similar method developed from scratch via an online version of approximate dynamic programming.

preprint2014arXiv

Online Markov decision processes with Kullback-Leibler control cost

This paper considers an online (real-time) control problem that involves an agent performing a discrete-time random walk over a finite state space. The agent's action at each time step is to specify the probability distribution for the next state given the current state. Following the set-up of Todorov, the state-action cost at each time step is a sum of a state cost and a control cost given by the Kullback-Leibler (KL) divergence between the agent's next-state distribution and that determined by some fixed passive dynamics. The online aspect of the problem is due to the fact that the state cost functions are generated by a dynamic environment, and the agent learns the current state cost only after selecting an action. An explicit construction of a computationally efficient strategy with small regret (i.e., expected difference between its actual total cost and the smallest cost attainable using noncausal knowledge of the state costs) under mild regularity conditions is presented, along with a demonstration of the performance of the proposed strategy on a simulated target tracking problem. A number of new results on Markov decision processes with KL control cost are also obtained.

preprint2014arXiv

Poisson's equation in nonlinear filtering

The aim of this paper is to provide a variational interpretation of the nonlinear filter in continuous time. A time-stepping procedure is introduced, consisting of successive minimization problems in the space of probability densities. The weak form of the nonlinear filter is derived via analysis of the first-order optimality conditions for these problems. The derivation shows the nonlinear filter dynamics may be regarded as a gradient flow, or a steepest descent, for a certain energy functional with respect to the Kullback-Leibler divergence. The second part of the paper is concerned with derivation of the feedback particle filter algorithm, based again on the analysis of the first variation. The algorithm is shown to be exact. That is, the posterior distribution of the particle matches exactly the true posterior, provided the filter is initialized with the true prior.

preprint2012arXiv

A recursive procedure for density estimation on the binary hypercube

This paper describes a recursive estimation procedure for multivariate binary densities (probability distributions of vectors of Bernoulli random variables) using orthogonal expansions. For $d$ covariates, there are $2^d$ basis coefficients to estimate, which renders conventional approaches computationally prohibitive when $d$ is large. However, for a wide class of densities that satisfy a certain sparsity condition, our estimator runs in probabilistic polynomial time and adapts to the unknown sparsity of the underlying density in two key ways: (1) it attains near-minimax mean-squared error for moderate sample sizes, and (2) the computational complexity is lower for sparser densities. Our method also allows for flexible control of the trade-off between mean-squared error and computational complexity.

preprint2012arXiv

Empirical processes, typical sequences and coordinated actions in standard Borel spaces

This paper proposes a new notion of typical sequences on a wide class of abstract alphabets (so-called standard Borel spaces), which is based on approximations of memoryless sources by empirical distributions uniformly over a class of measurable "test functions." In the finite-alphabet case, we can take all uniformly bounded functions and recover the usual notion of strong typicality (or typicality under the total variation distance). For a general alphabet, however, this function class turns out to be too large, and must be restricted. With this in mind, we define typicality with respect to any Glivenko-Cantelli function class (i.e., a function class that admits a Uniform Law of Large Numbers) and demonstrate its power by giving simple derivations of the fundamental limits on the achievable rates in several source coding scenarios, in which the relevant operational criteria pertain to reproducing empirical averages of a general-alphabet stationary memoryless source with respect to a suitable function class.

preprint2012arXiv

Sequential anomaly detection in the presence of noise and limited feedback

This paper describes a methodology for detecting anomalies from sequentially observed and potentially noisy data. The proposed approach consists of two main elements: (1) {\em filtering}, or assigning a belief or likelihood to each successive measurement based upon our ability to predict it from previous noisy observations, and (2) {\em hedging}, or flagging potential anomalies by comparing the current belief against a time-varying and data-adaptive threshold. The threshold is adjusted based on the available feedback from an end user. Our algorithms, which combine universal prediction with recent work on online convex programming, do not require computing posterior distributions given all current observations and involve simple primal-dual parameter updates. At the heart of the proposed approach lie exponential-family models which can be used in a wide variety of contexts and applications, and which yield methods that achieve sublinear per-round regret against both static and slowly varying product distributions with marginals drawn from the same exponential family. Moreover, the regret against static distributions coincides with the minimax value of the corresponding online strongly convex game. We also prove bounds on the number of mistakes made during the hedging step relative to the best offline choice of the threshold with access to all estimated beliefs and feedback signals. We validate the theory on synthetic data drawn from a time-varying distribution over binary vectors of high dimensionality, as well as on the Enron email dataset.

preprint2012arXiv

Target Detection Performance Bounds in Compressive Imaging

This paper describes computationally efficient approaches and associated theoretical performance guarantees for the detection of known targets and anomalies from few projection measurements of the underlying signals. The proposed approaches accommodate signals of different strengths contaminated by a colored Gaussian background, and perform detection without reconstructing the underlying signals from the observations. The theoretical performance bounds of the target detector highlight fundamental tradeoffs among the number of measurements collected, amount of background signal present, signal-to-noise ratio, and similarity among potential targets coming from a known dictionary. The anomaly detector is designed to control the number of false discoveries. The proposed approach does not depend on a known sparse representation of targets; rather, the theoretical performance bounds exploit the structure of a known dictionary of targets and the distance preservation property of the measurement matrix. Simulation experiments illustrate the practicality and effectiveness of the proposed approaches.

preprint2011arXiv

Directed information and Pearl's causal calculus

Probabilistic graphical models are a fundamental tool in statistics, machine learning, signal processing, and control. When such a model is defined on a directed acyclic graph (DAG), one can assign a partial ordering to the events occurring in the corresponding stochastic system. Based on the work of Judea Pearl and others, these DAG-based "causal factorizations" of joint probability measures have been used for characterization and inference of functional dependencies (causal links). This mostly expository paper focuses on several connections between Pearl's formalism (and in particular his notion of "intervention") and information-theoretic notions of causality and feedback (such as causal conditioning, directed stochastic kernels, and directed information). As an application, we show how conditional directed information can be used to develop an information-theoretic version of Pearl's "back-door" criterion for identifiability of causal effects from passive observations. This suggests that the back-door criterion can be thought of as a causal analog of statistical sufficiency.

preprint2011arXiv

Information-based complexity, feedback and dynamics in convex programming

We study the intrinsic limitations of sequential convex optimization through the lens of feedback information theory. In the oracle model of optimization, an algorithm queries an {\em oracle} for noisy information about the unknown objective function, and the goal is to (approximately) minimize every function in a given class using as few queries as possible. We show that, in order for a function to be optimized, the algorithm must be able to accumulate enough information about the objective. This, in turn, puts limits on the speed of optimization under specific assumptions on the oracle and the type of feedback. Our techniques are akin to the ones used in statistical literature to obtain minimax lower bounds on the risks of estimation procedures; the notable difference is that, unlike in the case of i.i.d. data, a sequential optimization algorithm can gather observations in a {\em controlled} manner, so that the amount of information at each step is allowed to change in time. In particular, we show that optimization algorithms often obey the law of diminishing returns: the signal-to-noise ratio drops as the optimization algorithm approaches the optimum. To underscore the generality of the tools, we use our approach to derive fundamental lower bounds for a certain active learning problem. Overall, the present work connects the intuitive notions of information in optimization, experimental design, estimation, and active learning to the quantitative notion of Shannon information.

preprint2010arXiv

Compressed sensing performance bounds under Poisson noise

This paper describes performance bounds for compressed sensing (CS) where the underlying sparse or compressible (sparsely approximable) signal is a vector of nonnegative intensities whose measurements are corrupted by Poisson noise. In this setting, standard CS techniques cannot be applied directly for several reasons. First, the usual signal-independent and/or bounded noise models do not apply to Poisson noise, which is non-additive and signal-dependent. Second, the CS matrices typically considered are not feasible in real optical systems because they do not adhere to important constraints, such as nonnegativity and photon flux preservation. Third, the typical $\ell_2$--$\ell_1$ minimization leads to overfitting in the high-intensity regions and oversmoothing in the low-intensity areas. In this paper, we describe how a feasible positivity- and flux-preserving sensing matrix can be constructed, and then analyze the performance of a CS reconstruction approach for Poisson data that minimizes an objective function consisting of a negative Poisson log likelihood term and a penalty term which measures signal sparsity. We show that, as the overall intensity of the underlying signal increases, an upper bound on the reconstruction error decays at an appropriate rate (depending on the compressibility of the signal), but that for a fixed signal intensity, the signal-dependent part of the error bound actually grows with the number of measurements or sensors. This surprising fact is both proved theoretically and justified based on physical intuition.

preprint2010arXiv

Divergence-based characterization of fundamental limitations of adaptive dynamical systems

Adaptive dynamical systems arise in a multitude of contexts, e.g., optimization, control, communications, signal processing, and machine learning. A precise characterization of their fundamental limitations is therefore of paramount importance. In this paper, we consider the general problem of adaptively controlling and/or identifying a stochastic dynamical system, where our {\em a priori} knowledge allows us to place the system in a subset of a metric space (the uncertainty set). We present an information-theoretic meta-theorem that captures the trade-off between the metric complexity (or richness) of the uncertainty set, the amount of information acquired online in the process of controlling and observing the system, and the residual uncertainty remaining after the observations have been collected. Following the approach of Zames, we quantify {\em a priori} information by the Kolmogorov (metric) entropy of the uncertainty set, while the information acquired online is expressed as a sum of information divergences. The general theory is used to derive new minimax lower bounds on the metric identification error, as well as to give a simple derivation of the minimum time needed to stabilize an uncertain stochastic linear system.

preprint2010arXiv

Fishing in Poisson streams: focusing on the whales, ignoring the minnows

This paper describes a low-complexity approach for reconstructing average packet arrival rates and instantaneous packet counts at a router in a communication network, where the arrivals of packets in each flow follow a Poisson process. Assuming that the rate vector of this Poisson process is sparse or approximately sparse, the goal is to maintain a compressed summary of the process sample paths using a small number of counters, such that at any time it is possible to reconstruct both the total number of packets in each flow and the underlying rate vector. We show that these tasks can be accomplished efficiently and accurately using compressed sensing with expander graphs. In particular, the compressive counts are a linear transformation of the underlying counting process by the adjacency matrix of an unbalanced expander. Such a matrix is binary and sparse, which allows for efficient incrementing when new packets arrive. We describe, analyze, and compare two methods that can be used to estimate both the current vector of total packet counts and the underlying vector of arrival rates.

preprint2009arXiv

Joint universal lossy coding and identification of stationary mixing sources with general alphabets

We consider the problem of joint universal variable-rate lossy coding and identification for parametric classes of stationary $β$-mixing sources with general (Polish) alphabets. Compression performance is measured in terms of Lagrangians, while identification performance is measured by the variational distance between the true source and the estimated source. Provided that the sources are mixing at a sufficiently fast rate and satisfy certain smoothness and Vapnik-Chervonenkis learnability conditions, it is shown that, for bounded metric distortions, there exist universal schemes for joint lossy compression and identification whose Lagrangian redundancies converge to zero as $\sqrt{V_n \log n /n}$ as the block length $n$ tends to infinity, where $V_n$ is the Vapnik-Chervonenkis dimension of a certain class of decision regions defined by the $n$-dimensional marginal distributions of the sources; furthermore, for each $n$, the decoder can identify $n$-dimensional marginal of the active source up to a ball of radius $O(\sqrt{V_n\log n/n})$ in variational distance, eventually with probability one. The results are supplemented by several examples of parametric sources satisfying the regularity conditions.

preprint2007arXiv

Learning from compressed observations

The problem of statistical learning is to construct a predictor of a random variable $Y$ as a function of a related random variable $X$ on the basis of an i.i.d. training sample from the joint distribution of $(X,Y)$. Allowable predictors are drawn from some specified class, and the goal is to approach asymptotically the performance (expected loss) of the best predictor in the class. We consider the setting in which one has perfect observation of the $X$-part of the sample, while the $Y$-part has to be communicated at some finite bit rate. The encoding of the $Y$-values is allowed to depend on the $X$-values. Under suitable regularity conditions on the admissible predictors, the underlying family of probability distributions and the loss function, we give an information-theoretic characterization of achievable predictor performance in terms of conditional distortion-rate functions. The ideas are illustrated on the example of nonparametric regression in Gaussian noise.

preprint2006arXiv

Joint universal lossy coding and identification of i.i.d. vector sources

The problem of joint universal source coding and modeling, addressed by Rissanen in the context of lossless codes, is generalized to fixed-rate lossy coding of continuous-alphabet memoryless sources. We show that, for bounded distortion measures, any compactly parametrized family of i.i.d. real vector sources with absolutely continuous marginals (satisfying appropriate smoothness and Vapnik--Chervonenkis learnability conditions) admits a joint scheme for universal lossy block coding and parameter estimation, and give nonasymptotic estimates of convergence rates for distortion redundancies and variational distances between the active source and the estimated source. We also present explicit examples of parametric sources admitting such joint universal compression and modeling schemes.

Maxim Raginsky

What is connected

Connect this record

See the researcher in context

Building this map preview

37 published item(s)

A Behavioral Framework for Data-Driven Modeling of Nonlinear Systems in Vector-Valued Reproducing Kernel Hilbert Spaces

Information-theoretic Limits of Learning and Estimation

Spatially-Coupled Network RNA Velocities: A Control-Theoretic Perspective

Talagrand Meets Talagrand: Upper and Lower Bounds on Expected Soft Maxima of Gaussian Processes with Finite Index Sets

Fitting an immersed submanifold to data via Sussmann's orbit theorem

Input-to-State Stable Neural Ordinary Differential Equations with Applications to Transient Modeling of Circuits

Minimum Excess Risk in Bayesian Learning

Model-Augmented Estimation of Conditional Mutual Information for Feature Selection

Non-signaling Approximations of Stochastic Team Problems

Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability

Minimax Statistical Learning with Wasserstein Distances

Information-theoretic lower bounds for distributed function computation

Concentration of measure without independence: a unified approach via the martingale method

Coordinate Dual Averaging for Decentralized Online Optimization with Nonseparable Global Objectives

Information-Theoretic Lower Bounds on Bayes Risk in Decentralized Estimation

Rationally inattentive control of Markov processes

Strong data processing inequalities and $Φ$-Sobolev inequalities for discrete channels

Concentration of Measure Inequalities and Their Communication and Information-Theoretic Applications

Concentration of Measure Inequalities in Information Theory, Communications and Coding (Second Edition)

Converses for distributed estimation via strong data processing inequalities

On MMSE estimation from quantized observations in the nonasymptotic regime

Online discrete optimization in social networks in the presence of Knightian uncertainty

Relax but stay in control: from value to algorithms for online Markov decision processes

Online Markov decision processes with Kullback-Leibler control cost

Poisson's equation in nonlinear filtering

A recursive procedure for density estimation on the binary hypercube

Empirical processes, typical sequences and coordinated actions in standard Borel spaces

Sequential anomaly detection in the presence of noise and limited feedback

Target Detection Performance Bounds in Compressive Imaging

Directed information and Pearl's causal calculus

Information-based complexity, feedback and dynamics in convex programming

Compressed sensing performance bounds under Poisson noise

Divergence-based characterization of fundamental limitations of adaptive dynamical systems

Fishing in Poisson streams: focusing on the whales, ignoring the minnows

Joint universal lossy coding and identification of stationary mixing sources with general alphabets

Learning from compressed observations

Joint universal lossy coding and identification of i.i.d. vector sources