Source author record

Andrea Montanari

Andrea Montanari appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

87works

31topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Fundamental Barriers to High-Dimensional Regression with Convex Penalties

In high-dimensional regression, we attempt to estimate a parameter vector $β_0\in\mathbb{R}^p$ from $n\lesssim p$ observations $\{(y_i,x_i)\}_{i\leq n}$ where $x_i\in\mathbb{R}^p$ is a vector of predictors and $y_i$ is a response variable. A well-established approach uses convex regularizers to promote specific structures (e.g. sparsity) of the estimate $\widehatβ$, while allowing for practical algorithms. Theoretical analysis implies that convex penalization schemes have nearly optimal estimation properties in certain settings. However, in general the gaps between statistically optimal estimation (with unbounded computational resources) and convex methods are poorly understood. We show that when the statistican has very simple structural information about the distribution of the entries of $β_0$, a large gap frequently exists between the best performance achieved by any convex regularizer satisfying a mild technical condition and either (i) the optimal statistical error or (ii) the statistical error achieved by optimal approximate message passing algorithms. Remarkably, a gap occurs at high enough signal-to-noise ratio if and only if the distribution of the coordinates of $β_0$ is not log-concave. These conclusions follow from an analysis of standard Gaussian designs. Our lower bounds are expected to be generally tight, and we prove tightness under certain conditions.

preprint2022arXiv

Optimization of random high-dimensional functions: Structure and algorithms

Replica symmetry breaking postulates that near optima of spin glass Hamiltonians have an ultrametric structure. Namely, near optima can be associated to leaves of a tree, and the Euclidean distance between them corresponds to the distance along this tree. We survey recent progress towards a rigorous proof of this picture in the context of mixed $p$-spin spin glass models. We focus in particular on the following topics: $(i)$~The structure of critical points of the Hamiltonian; $(ii)$~The realization of the ultrametric tree as near optima of a suitable TAP free energy; $(iii)$~The construction of efficient optimization algorithm that exploits this picture.

preprint2022arXiv

Statistically Optimal First Order Algorithms: A Proof via Orthogonalization

We consider a class of statistical estimation problems in which we are given a random data matrix ${\boldsymbol X}\in {\mathbb R}^{n\times d}$ (and possibly some labels ${\boldsymbol y}\in{\mathbb R}^n$) and would like to estimate a coefficient vector ${\boldsymbol θ}\in{\mathbb R}^d$ (or possibly a constant number of such vectors). Special cases include low-rank matrix estimation and regularized estimation in generalized linear models (e.g., sparse regression). First order methods proceed by iteratively multiplying current estimates by ${\boldsymbol X}$ or its transpose. Examples include gradient descent or its accelerated variants. Celentano, Montanari, Wu proved that for any constant number of iterations (matrix vector multiplications), the optimal first order algorithm is a specific approximate message passing algorithm (known as `Bayes AMP'). The error of this estimator can be characterized in the high-dimensional asymptotics $n,d\to\infty$, $n/d\toδ$, and provides a lower bound to the estimation error of any first order algorithm. Here we present a simpler proof of the same result, and generalize it to broader classes of data distributions and of first order algorithms, including algorithms with non-separable nonlinearities. Most importantly, the new proof technique does not require to construct an equivalent tree-structured estimation problem, and is therefore susceptible of a broader range of applications.

preprint2022arXiv

The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training

Modern neural networks are often operated in a strongly overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones. Despite this, they achieve good prediction error on unseen data: interpolating the training set does not lead to a large generalization error. Further, overparametrization appears to be beneficial in that it simplifies the optimization landscape. Here we study these phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We consider a simple data model, with isotropic covariates vectors in $d$ dimensions, and $N$ hidden neurons. We assume that both the sample size $n$ and the dimension $d$ are large, and they are polynomially related. Our first main result is a characterization of the eigenstructure of the empirical NT kernel in the overparametrized regime $Nd\gg n$. This characterization implies as a corollary that the minimum eigenvalue of the empirical NT kernel is bounded away from zero as soon as $Nd\gg n$, and therefore the network can exactly interpolate arbitrary labels in the same regime. Our second main result is a characterization of the generalization error of NT ridge regression including, as a special case, min-$\ell_2$ norm interpolation. We prove that, as soon as $Nd\gg n$, the test error is well approximated by the one of kernel ridge regression with respect to the infinite-width kernel. The latter is in turn well approximated by the error of polynomial ridge regression, whereby the regularization parameter is increased by a `self-induced' term related to the high-degree components of the activation function. The polynomial degree depends on the sample size and the dimension (in particular on $\log n/\log d$).

preprint2021arXiv

Generalization error of random features and kernel methods: hypercontractivity and kernel matrix concentration

Consider the classical supervised learning problem: we are given data $(y_i,{\boldsymbol x}_i)$, $i\le n$, with $y_i$ a response and ${\boldsymbol x}_i\in {\mathcal X}$ a covariates vector, and try to learn a model $f:{\mathcal X}\to{\mathbb R}$ to predict future responses. Random features methods map the covariates vector ${\boldsymbol x}_i$ to a point ${\boldsymbol ϕ}({\boldsymbol x}_i)$ in a higher dimensional space ${\mathbb R}^N$, via a random featurization map ${\boldsymbol ϕ}$. We study the use of random features methods in conjunction with ridge regression in the feature space ${\mathbb R}^N$. This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime. We define a class of problems satisfying certain spectral conditions on the underlying kernels, and a hypercontractivity assumption on the associated eigenfunctions. These conditions are verified by classical high-dimensional examples. Under these conditions, we prove a sharp characterization of the error of random features ridge regression. In particular, we address two fundamental questions: $(1)$~What is the generalization error of KRR? $(2)$~How big $N$ should be for the random features approximation to achieve the same error as KRR? In this setting, we prove that KRR is well approximated by a projection onto the top $\ell$ eigenfunctions of the kernel, where $\ell$ depends on the sample size $n$. We show that the test error of random features ridge regression is dominated by its approximation error and is larger than the error of KRR as long as $N\le n^{1-δ}$ for some $δ>0$. We characterize this gap. For $N\ge n^{1+δ}$, random features achieve the same error as the corresponding KRR, and further increasing $N$ does not lead to a significant change in test error.

preprint2021arXiv

Learning with invariances in random features and kernel models

A number of machine learning tasks entail a high degree of invariance: the data distribution does not change if we act on the data with a certain group of transformations. For instance, labels of images are invariant under translations of the images. Certain neural network architectures -- for instance, convolutional networks -- are believed to owe their success to the fact that they exploit such invariance properties. With the objective of quantifying the gain achieved by invariant architectures, we introduce two classes of models: invariant random features and invariant kernel methods. The latter includes, as a special case, the neural tangent kernel for convolutional networks with global average pooling. We consider uniform covariates distributions on the sphere and hypercube and a general invariant target function. We characterize the test error of invariant methods in a high-dimensional regime in which the sample size and number of hidden units scale as polynomials in the dimension, for a class of groups that we call `degeneracy $α$', with $α\leq 1$. We show that exploiting invariance in the architecture saves a $d^α$ factor ($d$ stands for the dimension) in sample size and number of hidden units to achieve the same test error as for unstructured architectures. Finally, we show that output symmetrization of an unstructured kernel estimator does not give a significant statistical improvement; on the other hand, data augmentation with an unstructured kernel estimator is equivalent to an invariant kernel estimator and enjoys the same improvement in statistical efficiency.

preprint2021arXiv

When Do Neural Networks Outperform Kernel Methods?

For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If covariates are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present the spiked covariates model that can capture in a unified framework both behaviors observed in earlier work. We hypothesize that such a latent low-dimensional structure is present in image classification. We test numerically this hypothesis by showing that specific perturbations of the training distribution degrade the performances of RKHS methods much more significantly than NNs.

preprint2020arXiv

Imputation for High-Dimensional Linear Regression

We study high-dimensional regression with missing entries in the covariates. A common strategy in practice is to \emph{impute} the missing entries with an appropriate substitute and then implement a standard statistical procedure acting as if the covariates were fully observed. Recent literature on this subject proposes instead to design a specific, often complicated or non-convex, algorithm tailored to the case of missing covariates. We investigate a simpler approach where we fill-in the missing entries with their conditional mean given the observed covariates. We show that this imputation scheme coupled with standard off-the-shelf procedures such as the LASSO and square-root LASSO retains the minimax estimation rate in the random-design setting where the covariates are i.i.d.\ sub-Gaussian. We further show that the square-root LASSO remains \emph{pivotal} in this setting. It is often the case that the conditional expectation cannot be computed exactly and must be approximated from data. We study two cases where the covariates either follow an autoregressive (AR) process, or are jointly Gaussian with sparse precision matrix. We propose tractable estimators for the conditional expectation and then perform linear regression via LASSO, and show similar estimation rates in both cases. We complement our theoretical results with simulations on synthetic and semi-synthetic examples, illustrating not only the sharpness of our bounds, but also the broader utility of this strategy beyond our theoretical assumptions.

preprint2020arXiv

Linearized two-layers neural networks in high dimension

We consider the problem of learning an unknown function $f_{\star}$ on the $d$-dimensional sphere with respect to the square loss, given i.i.d. samples $\{(y_i,{\boldsymbol x}_i)\}_{i\le n}$ where ${\boldsymbol x}_i$ is a feature vector uniformly distributed on the sphere and $y_i=f_{\star}({\boldsymbol x}_i)+\varepsilon_i$. We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons $N$ diverges, for a fixed dimension $d$. We consider two specific regimes: the approximation-limited regime, in which $n=\infty$ while $d$ and $N$ are large but finite; and the sample size-limited regime in which $N=\infty$ while $d$ and $n$ are large but finite. In the first regime we prove that if $d^{\ell + δ} \le N\le d^{\ell+1-δ}$ for small $δ> 0$, then \RF\, effectively fits a degree-$\ell$ polynomial in the raw features, and \NT\, fits a degree-$(\ell+1)$ polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is $d^{\ell + δ} \le n \le d^{\ell +1-δ}$, then kernel methods can fit at most a a degree-$\ell$ polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization.

preprint2020arXiv

Optimization of Mean-field Spin Glasses

Mean-field spin glasses are families of random energy functions (Hamiltonians) on high-dimensional product spaces. In this paper we consider the case of Ising mixed $p$-spin models, namely Hamiltonians $H_N:Σ_N\to {\mathbb R}$ on the Hamming hypercube $Σ_N = \{\pm 1\}^N$, which are defined by the property that $\{H_N({\boldsymbol σ})\}_{{\boldsymbol σ}\in Σ_N}$ is a centered Gaussian process with covariance ${\mathbb E}\{H_N({\boldsymbol σ}_1)H_N({\boldsymbol σ}_2)\}$ depending only on the scalar product $\langle {\boldsymbol σ}_1,{\boldsymbol σ}_2\rangle$. The asymptotic value of the optimum $\max_{{\boldsymbol σ}\in Σ_N}H_N({\boldsymbol σ})$ was characterized in terms of a variational principle known as the Parisi formula, first proved by Talagrand and, in a more general setting, by Panchenko. The structure of superlevel sets is extremely rich and has been studied by a number of authors. Here we ask whether a near optimal configuration ${\boldsymbol σ}$ can be computed in polynomial time. We develop a message passing algorithm whose complexity per-iteration is of the same order as the complexity of evaluating the gradient of $H_N$, and characterize the typical energy value it achieves. When the $p$-spin model $H_N$ satisfies a certain no-overlap gap assumption, for any $\varepsilon>0$, the algorithm outputs ${\boldsymbol σ}\inΣ_N$ such that $H_N({\boldsymbol σ})\ge (1-\varepsilon)\max_{{\boldsymbol σ}'} H_N({\boldsymbol σ}')$, with high probability. The number of iterations is bounded in $N$ and depends uniquely on $\varepsilon$. More generally, regardless of whether the no-overlap gap assumption holds, the energy achieved is given by an extended variational principle, which generalizes the Parisi formula.

preprint2020arXiv

Surprises in High-Dimensional Ridgeless Least Squares Interpolation

Interpolators -- estimators that achieve zero training error -- have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum $\ell_2$ norm ("ridgeless") interpolation in high-dimensional least squares regression. We consider two different models for the feature distribution: a linear model, where the feature vectors $x_i \in {\mathbb R}^p$ are obtained by applying a linear transform to a vector of i.i.d. entries, $x_i = Σ^{1/2} z_i$ (with $z_i \in {\mathbb R}^p$); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, $x_i = φ(W z_i)$ (with $z_i \in {\mathbb R}^d$, $W \in {\mathbb R}^{p \times d}$ a matrix of i.i.d. entries, and $φ$ an activation function acting componentwise on $W z_i$). We recover -- in a precise quantitative way -- several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.

preprint2020arXiv

TAP free energy, spin glasses, and variational inference

We consider the Sherrington-Kirkpatrick model of spin glasses with ferromagnetically biased couplings. For a specific choice of the couplings mean, the resulting Gibbs measure is equivalent to the Bayesian posterior for a high-dimensional estimation problem known as `$Z_2$ synchronization'. Statistical physics suggests to compute the expectation with respect to this Gibbs measure (the posterior mean in the synchronization problem), by minimizing the so-called Thouless-Anderson-Palmer (TAP) free energy, instead of the mean field (MF) free energy. We prove that this identification is correct, provided the ferromagnetic bias is larger than a constant (i.e. the noise level is small enough in synchronization). Namely, we prove that the scaled $\ell_2$ distance between any low energy local minimizers of the TAP free energy and the mean of the Gibbs measure vanishes in the large size limit. Our proof technique is based on upper bounding the expected number of critical points of the TAP free energy using the Kac-Rice formula.

preprint2020arXiv

The estimation error of general first order methods

Modern large-scale statistical models require to estimate thousands to millions of parameters. This is often accomplished by iterative algorithms such as gradient descent, projected gradient descent or their accelerated versions. What are the fundamental limits to these approaches? This question is well understood from an optimization viewpoint when the underlying objective is convex. Work in this area characterizes the gap to global optimality as a function of the number of iterations. However, these results have only indirect implications in terms of the gap to statistical optimality. Here we consider two families of high-dimensional estimation problems: high-dimensional regression and low-rank matrix estimation, and introduce a class of `general first order methods' that aim at efficiently estimating the underlying parameters. This class of algorithms is broad enough to include classical first order optimization (for convex and non-convex objectives), but also other types of algorithms. Under a random design assumption, we derive lower bounds on the estimation error that hold in the high-dimensional asymptotics in which both the number of observations and the number of parameters diverge. These lower bounds are optimal in the sense that there exist algorithms whose estimation error matches the lower bounds up to asymptotically negligible terms. We illustrate our general results through applications to sparse phase retrieval and sparse principal component analysis.

preprint2018arXiv

A Mean Field View of the Landscape of Two-Layers Neural Networks

Multi-layer neural networks are among the most powerful models in machine learning, yet the fundamental reasons for this success defy mathematical understanding. Learning a neural network requires to optimize a non-convex high-dimensional objective (risk function), a problem which is usually attacked using stochastic gradient descent (SGD). Does SGD converge to a global optimum of the risk or only to a local optimum? In the first case, does this happen because local minima are absent, or because SGD somehow avoids them? In the second, why do local minima reached by SGD have good generalization properties? In this paper we consider a simple case, namely two-layers neural networks, and prove that -in a suitable scaling limit- SGD dynamics is captured by a certain non-linear partial differential equation (PDE) that we call distributional dynamics (DD). We then consider several specific examples, and show how DD can be used to prove convergence of SGD to networks with nearly ideal generalization error. This description allows to 'average-out' some of the complexities of the landscape of neural networks, and can be used to prove a general convergence result for noisy SGD.

preprint2016arXiv

A Grothendieck-type inequality for local maxima

A large number of problems in optimization, machine learning, signal processing can be effectively addressed by suitable semidefinite programming (SDP) relaxations. Unfortunately, generic SDP solvers hardly scale beyond instances with a few hundreds variables (in the underlying combinatorial problem). On the other hand, it has been observed empirically that an effective strategy amounts to introducing a (non-convex) rank constraint, and solving the resulting smooth optimization problem by ascent methods. This non-convex problem has --generically-- a large number of local maxima, and the reason for this success is therefore unclear. This paper provides rigorous support for this approach. For the problem of maximizing a linear functional over the elliptope, we prove that all local maxima are within a small gap from the SDP optimum. In several problems of interest, arbitrarily small relative error can be achieved by taking the rank constraint $k$ to be of order one, independently of the problem size.

preprint2016arXiv

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs

Performing statistical inference in high-dimension is an outstanding challenge. A major source of difficulty is the absence of precise information on the distribution of high-dimensional estimators. Here, we consider linear regression in the high-dimensional regime $p\gg n$. In this context, we would like to perform inference on a high-dimensional parameters vector $θ^*\in{\mathbb R}^p$. Important progress has been achieved in computing confidence intervals for single coordinates $θ^*_i$. A key role in these new methods is played by a certain debiased estimator $\hatθ^{\rm d}$ that is constructed from the Lasso. Earlier work establishes that, under suitable assumptions on the design matrix, the coordinates of $\hatθ^{\rm d}$ are asymptotically Gaussian provided $θ^*$ is $s_0$-sparse with $s_0 = o(\sqrt{n}/\log p )$. The condition $s_0 = o(\sqrt{n}/ \log p )$ is stronger than the one for consistent estimation, namely $s_0 = o(n/ \log p)$. We study Gaussian designs with known or unknown population covariance. When the covariance is known, we prove that the debiased estimator is asymptotically Gaussian under the nearly optimal condition $s_0 = o(n/ (\log p)^2)$. Note that earlier work was limited to $s_0 = o(\sqrt{n}/\log p)$ even for perfectly known covariance. The same conclusion holds if the population covariance is unknown but can be estimated sufficiently well, e.g. under the same sparsity conditions on the inverse covariance as assumed by earlier work. For intermediate regimes, we describe the trade-off between sparsity in the coefficients and in the inverse covariance of the design. We further discuss several applications of our results to high-dimensional inference. In particular, we propose a new estimator that is minimax optimal up to a factor $1+o_n(1)$ for i.i.d. Gaussian designs.

preprint2016arXiv

How Well Do Local Algorithms Solve Semidefinite Programs?

Several probabilistic models from high-dimensional statistics and machine learning reveal an intriguing --and yet poorly understood-- dichotomy. Either simple local algorithms succeed in estimating the object of interest, or even sophisticated semi-definite programming (SDP) relaxations fail. In order to explore this phenomenon, we study a classical SDP relaxation of the minimum graph bisection problem, when applied to Erdős-Renyi random graphs with bounded average degree $d>1$, and obtain several types of results. First, we use a dual witness construction (using the so-called non-backtracking matrix of the graph) to upper bound the SDP value. Second, we prove that a simple local algorithm approximately solves the SDP to within a factor $2d^2/(2d^2+d-1)$ of the upper bound. In particular, the local algorithm is at most $8/9$ suboptimal, and $1+O(1/d)$ suboptimal for large degree. We then analyze a more sophisticated local algorithm, which aggregates information according to the harmonic measure on the limiting Galton-Watson (GW) tree. The resulting lower bound is expressed in terms of the conductance of the GW tree and matches surprisingly well the empirically determined SDP values on large-scale Erdős-Renyi graphs. We finally consider the planted partition model. In this case, purely local algorithms are known to fail, but they do succeed if a small amount of side information is available. Our results imply quantitative bounds on the threshold for partial recovery using SDP in this model.

preprint2016arXiv

Performance of a community detection algorithm based on semidefinite programming

The problem of detecting communities in a graph is maybe one the most studied inference problems, given its simplicity and widespread diffusion among several disciplines. A very common benchmark for this problem is the stochastic block model or planted partition problem, where a phase transition takes place in the detection of the planted partition by changing the signal-to-noise ratio. Optimal algorithms for the detection exist which are based on spectral methods, but we show these are extremely sensible to slight modification in the generative model. Recently Javanmard, Montanari and Ricci-Tersenghi (arXiv:1511.08769) have used statistical physics arguments, and numerical simulations to show that finding communities in the stochastic block model via semidefinite programming is quasi optimal. Further, the resulting semidefinite relaxation can be solved efficiently, and is very robust with respect to changes in the generative model. In this paper we study in detail several practical aspects of this new algorithm based on semidefinite programming for the detection of the planted partition. The algorithm turns out to be very fast, allowing the solution of problems with $O(10^5)$ variables in few second on a laptop computer.

preprint2016arXiv

Sparse PCA via Covariance Thresholding

In sparse principal component analysis we are given noisy observations of a low-rank matrix of dimension $n\times p$ and seek to reconstruct it under additional sparsity assumptions. In particular, we assume here each of the principal components $\mathbf{v}_1,\dots,\mathbf{v}_r$ has at most $s_0$ non-zero entries. We are particularly interested in the high dimensional regime wherein $p$ is comparable to, or even much larger than $n$. In an influential paper, \cite{johnstone2004sparse} introduced a simple algorithm that estimates the support of the principal vectors $\mathbf{v}_1,\dots,\mathbf{v}_r$ by the largest entries in the diagonal of the empirical covariance. This method can be shown to identify the correct support with high probability if $s_0\le K_1\sqrt{n/\log p}$, and to fail with high probability if $s_0\ge K_2 \sqrt{n/\log p}$ for two constants $0<K_1,K_2<\infty$. Despite a considerable amount of work over the last ten years, no practical algorithm exists with provably better support recovery guarantees. Here we analyze a covariance thresholding algorithm that was recently proposed by \cite{KrauthgamerSPCA}. On the basis of numerical simulations (for the rank-one case), these authors conjectured that covariance thresholding correctly recover the support with high probability for $s_0\le K\sqrt{n}$ (assuming $n$ of the same order as $p$). We prove this conjecture, and in fact establish a more general guarantee including higher-rank as well as $n$ much smaller than $p$. Recent lower bounds \cite{berthet2013computational, ma2015sum} suggest that no polynomial time algorithm can do significantly better. The key technical component of our analysis develops new bounds on the norm of kernel random matrices, in regimes that were not considered before.

preprint2016arXiv

Spectral algorithms for tensor completion

In the tensor completion problem, one seeks to estimate a low-rank tensor based on a random sample of revealed entries. In terms of the required sample size, earlier work revealed a large gap between estimation with unbounded computational resources (using, for instance, tensor nuclear norm minimization) and polynomial-time algorithms. Among the latter, the best statistical guarantees have been proved, for third-order tensors, using the sixth level of the sum-of-squares (SOS) semidefinite programming hierarchy (Barak and Moitra, 2014). However, the SOS approach does not scale well to large problem instances. By contrast, spectral methods --- based on unfolding or matricizing the tensor --- are attractive for their low complexity, but have been believed to require a much larger sample size. This paper presents two main contributions. First, we propose a new unfolding-based method, which outperforms naive ones for symmetric $k$-th order tensors of rank $r$. For this result we make a study of singular space estimation for partially revealed matrices of large aspect ratio, which may be of independent interest. For third-order tensors, our algorithm matches the SOS method in terms of sample size (requiring about $rd^{3/2}$ revealed entries), subject to a worse rank condition ($r\ll d^{3/4}$ rather than $r\ll d^{3/2}$). We complement this result with a different spectral algorithm for third-order tensors in the overcomplete ($r\ge d$) regime. Under a random model, this second approach succeeds in estimating tensors of rank $d\le r \ll d^{3/2}$ from about $rd^{3/2}$ revealed entries.

preprint2015arXiv

A Perspective on Future Research Directions in Information Theory

Information theory is rapidly approaching its 70th birthday. What are promising future directions for research in information theory? Where will information theory be having the most impact in 10-20 years? What new and emerging areas are ripe for the most impact, of the sort that information theory has had on the telecommunications industry over the last 60 years? How should the IEEE Information Theory Society promote high-risk new research directions and broaden the reach of information theory, while continuing to be true to its ideals and insisting on the intellectual rigor that makes its breakthroughs so powerful? These are some of the questions that an ad hoc committee (composed of the present authors) explored over the past two years. We have discussed and debated these questions, and solicited detailed inputs from experts in fields including genomics, biology, economics, and neuroscience. This report is the result of these discussions.

preprint2015arXiv

Asymptotic Mutual Information for the Two-Groups Stochastic Block Model

We develop an information-theoretic view of the stochastic block model, a popular statistical model for the large-scale structure of complex networks. A graph $G$ from such a model is generated by first assigning vertex labels at random from a finite alphabet, and then connecting vertices with edge probabilities depending on the labels of the endpoints. In the case of the symmetric two-group model, we establish an explicit `single-letter' characterization of the per-vertex mutual information between the vertex labels and the graph. The explicit expression of the mutual information is intimately related to estimation-theoretic quantities, and --in particular-- reveals a phase transition at the critical point for community detection. Below the critical point the per-vertex mutual information is asymptotically the same as if edges were independent. Correspondingly, no algorithm can estimate the partition better than random guessing. Conversely, above the threshold, the per-vertex mutual information is strictly smaller than the independent-edges upper bound. In this regime there exists a procedure that estimates the vertex labels better than random guessing.

preprint2015arXiv

Computational Implications of Reducing Data to Sufficient Statistics

Given a large dataset and an estimation task, it is common to pre-process the data by reducing them to a set of sufficient statistics. This step is often regarded as straightforward and advantageous (in that it simplifies statistical analysis). I show that -on the contrary- reducing data to sufficient statistics can change a computationally tractable estimation problem into an intractable one. I discuss connections with recent work in theoretical computer science, and implications for some techniques to estimate graphical models.

preprint2015arXiv

Convergence rates of sub-sampled Newton methods

We consider the problem of minimizing a sum of $n$ functions over a convex parameter set $\mathcal{C} \subset \mathbb{R}^p$ where $n\gg p\gg 1$. In this regime, algorithms which utilize sub-sampling techniques are known to be effective. In this paper, we use sub-sampling techniques together with low-rank approximation to design a new randomized batch algorithm which possesses comparable convergence rate to Newton's method, yet has much smaller per-iteration cost. The proposed algorithm is robust in terms of starting point and step size, and enjoys a composite convergence rate, namely, quadratic convergence at start and linear convergence when the iterate is close to the minimizer. We develop its theoretical analysis which also allows us to select near-optimal algorithm parameters. Our theoretical results can be used to obtain convergence rates of previously proposed sub-sampling based algorithms as well. We demonstrate how our results apply to well-known machine learning problems. Lastly, we evaluate the performance of our algorithm on several datasets under various scenarios.

preprint2015arXiv

Finding One Community in a Sparse Graph

We consider a random sparse graph with bounded average degree, in which a subset of vertices has higher connectivity than the background. In particular, the average degree inside this subset of vertices is larger than outside (but still bounded). Given a realization of such graph, we aim at identifying the hidden subset of vertices. This can be regarded as a model for the problem of finding a tightly knitted community in a social network, or a cluster in a relational dataset. In this paper we present two sets of contributions: $(i)$ We use the cavity method from spin glass theory to derive an exact phase diagram for the reconstruction problem. In particular, as the difference in edge probability increases, the problem undergoes two phase transitions, a static phase transition and a dynamic one. $(ii)$ We establish rigorous bounds on the dynamic phase transition and prove that, above a certain threshold, a local algorithm (belief propagation) correctly identify most of the hidden set. Below the same threshold \emph{no local algorithm} can achieve this goal. However, in this regime the subset can be identified by exhaustive search. For small hidden sets and large average degree, the phase transition for local algorithms takes an intriguingly simple form. Local algorithms succeed with high probability for ${\rm deg}_{\rm in} - {\rm deg}_{\rm out} > \sqrt{{\rm deg}_{\rm out}/e}$ and fail for ${\rm deg}_{\rm in} - {\rm deg}_{\rm out} < \sqrt{{\rm deg}_{\rm out}/e}$ (with ${\rm deg}_{\rm in}$, ${\rm deg}_{\rm out}$ the average degrees inside and outside the community). We argue that spectral algorithms are also ineffective in the latter regime. It is an open problem whether any polynomial time algorithms might succeed for ${\rm deg}_{\rm in} - {\rm deg}_{\rm out} < \sqrt{{\rm deg}_{\rm out}/e}$.

preprint2015arXiv

Improved Sum-of-Squares Lower Bounds for Hidden Clique and Hidden Submatrix Problems

Given a large data matrix $A\in\mathbb{R}^{n\times n}$, we consider the problem of determining whether its entries are i.i.d. with some known marginal distribution $A_{ij}\sim P_0$, or instead $A$ contains a principal submatrix $A_{{\sf Q},{\sf Q}}$ whose entries have marginal distribution $A_{ij}\sim P_1\neq P_0$. As a special case, the hidden (or planted) clique problem requires to find a planted clique in an otherwise uniformly random graph. Assuming unbounded computational resources, this hypothesis testing problem is statistically solvable provided $|{\sf Q}|\ge C \log n$ for a suitable constant $C$. However, despite substantial effort, no polynomial time algorithm is known that succeeds with high probability when $|{\sf Q}| = o(\sqrt{n})$. Recently Meka and Wigderson \cite{meka2013association}, proposed a method to establish lower bounds within the Sum of Squares (SOS) semidefinite hierarchy. Here we consider the degree-$4$ SOS relaxation, and study the construction of \cite{meka2013association} to prove that SOS fails unless $k\ge C\, n^{1/3}/\log n$. An argument presented by Barak implies that this lower bound cannot be substantially improved unless the witness construction is changed in the proof. Our proof uses the moments method to bound the spectrum of a certain random association scheme, i.e. a symmetric random matrix whose rows and columns are indexed by the edges of an Erdös-Renyi random graph.

preprint2015arXiv

On Online Control of False Discovery Rate

Multiple hypotheses testing is a core problem in statistical inference and arises in almost every scientific field. Given a sequence of null hypotheses $\mathcal{H}(n) = (H_1,..., H_n)$, Benjamini and Hochberg \cite{benjamini1995controlling} introduced the false discovery rate (FDR) criterion, which is the expected proportion of false positives among rejected null hypotheses, and proposed a testing procedure that controls FDR below a pre-assigned significance level. They also proposed a different criterion, called mFDR, which does not control a property of the realized set of tests; rather it controls the ratio of expected number of false discoveries to the expected number of discoveries. In this paper, we propose two procedures for multiple hypotheses testing that we will call "LOND" and "LORD". These procedures control FDR and mFDR in an \emph{online manner}. Concretely, we consider an ordered --possibly infinite-- sequence of null hypotheses $\mathcal{H} = (H_1,H_2,H_3,...)$ where, at each step $i$, the statistician must decide whether to reject hypothesis $H_i$ having access only to the previous decisions. To the best of our knowledge, our work is the first that controls FDR in this setting. This model was introduced by Foster and Stine \cite{alpha-investing} whose alpha-investing rule only controls mFDR in online manner. In order to compare different procedures, we develop lower bounds on the total discovery rate under the mixture model and prove that both LOND and LORD have nearly linear number of discoveries. We further propose adjustment to LOND to address arbitrary correlation among the $p$-values. Finally, we evaluate the performance of our procedures on both synthetic and real data comparing them with alpha-investing rule, Benjamin-Hochberg method and a Bonferroni procedure.

preprint2015arXiv

The LASSO risk for gaussian matrices

We consider the problem of learning a coefficient vector x_0\in R^N from noisy linear observation y=Ax_0+w \in R^n. In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x'. In this case, a popular approach consists in solving an L1-penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Simulations on real data matrices suggest that our results can be relevant in a broad array of practical applications.

preprint2015arXiv

The set of solutions of random XORSAT formulae

The XOR-satisfiability (XORSAT) problem requires finding an assignment of $n$ Boolean variables that satisfy $m$ exclusive OR (XOR) clauses, whereby each clause constrains a subset of the variables. We consider random XORSAT instances, drawn uniformly at random from the ensemble of formulae containing $n$ variables and $m$ clauses of size $k$. This model presents several structural similarities to other ensembles of constraint satisfaction problems, such as $k$-satisfiability ($k$-SAT), hypergraph bicoloring and graph coloring. For many of these ensembles, as the number of constraints per variable grows, the set of solutions shatters into an exponential number of well-separated components. This phenomenon appears to be related to the difficulty of solving random instances of such problems. We prove a complete characterization of this clustering phase transition for random $k$-XORSAT. In particular, we prove that the clustering threshold is sharp and determine its exact location. We prove that the set of solutions has large conductance below this threshold and that each of the clusters has large conductance above the same threshold. Our proof constructs a very sparse basis for the set of solutions (or the subset within a cluster). This construction is intimately tied to the construction of specific subgraphs of the hypergraph associated with an instance of $k$-XORSAT. In order to study such subgraphs, we establish novel local weak convergence results for them.

preprint2015arXiv

Universality in polytope phase transitions and message passing algorithms

We consider a class of nonlinear mappings $\mathsf{F}_{A,N}$ in $\mathbb{R}^N$ indexed by symmetric random matrices $A\in\mathbb{R}^{N\times N}$ with independent entries. Within spin glass theory, special cases of these mappings correspond to iterating the TAP equations and were studied by Bolthausen [Comm. Math. Phys. 325 (2014) 333-366]. Within information theory, they are known as "approximate message passing" algorithms. We study the high-dimensional (large $N$) behavior of the iterates of $\mathsf{F}$ for polynomial functions $\mathsf{F}$, and prove that it is universal; that is, it depends only on the first two moments of the entries of $A$, under a sub-Gaussian tail condition. As an application, we prove the universality of a certain phase transition arising in polytope geometry and compressed sensing. This solves, for a broad class of random projections, a conjecture by David Donoho and Jared Tanner.

preprint2015arXiv

Variance Breakdown of Huber (M)-estimators: $n/p \rightarrow m \in (1,\infty)$

A half century ago, Huber evaluated the minimax asymptotic variance in scalar location estimation, $ \min_ψ\max_{F \in {\cal F}_ε} V(ψ, F) = \frac{1}{I(F_ε^*)} $, where $V(ψ,F)$ denotes the asymptotic variance of the $(M)$-estimator for location with score function $ψ$, and $I(F_ε^*)$ is the minimal Fisher information $ \min_{{\cal F}_ε} I(F)$ over the class of $ε$-Contaminated Normal distributions. We consider the linear regression model $Y = Xθ_0 + W$, $W_i\sim_{\text{i.i.d.}}F$, and iid Normal predictors $X_{i,j}$, working in the high-dimensional-limit asymptotic where the number $n$ of observations and $p$ of variables both grow large, while $n/p \rightarrow m \in (1,\infty)$; hence $m$ plays the role of `asymptotic number of observations per parameter estimated'. Let $V_m(ψ,F)$ denote the per-coordinate asymptotic variance of the $(M)$-estimator of regression in the $n/p \rightarrow m$ regime. Then $V_m \neq V$; however $V_m \rightarrow V$ as $m \rightarrow \infty$. In this paper we evaluate the minimax asymptotic variance of the Huber $(M)$-estimate. The statistician minimizes over the family $(ψ_λ)_{λ> 0}$ of all tunings of Huber $(M)$-estimates of regression, and Nature maximizes over gross-error contaminations $F \in {\cal F}_ε$. Suppose that $I(F_ε^*) \cdot m > 1$. Then $ \min_λ\max_{F \in {\cal F}_ε} V_m(ψ_λ, F) = \frac{1}{I(F_ε^*) - 1/m} $. Strikingly, if $I(F_ε^*) \cdot m \leq 1$, then the minimax asymptotic variance is $+\infty$. The breakdown point is where the Fisher information per parameter equals unity.

preprint2014arXiv

A statistical model for tensor PCA

We consider the Principal Component Analysis problem for large tensors of arbitrary order $k$ under a single-spike (or rank-one plus noise) model. On the one hand, we use information theory, and recent results in probability theory, to establish necessary and sufficient conditions under which the principal component can be estimated using unbounded computational resources. It turns out that this is possible as soon as the signal-to-noise ratio $β$ becomes larger than $C\sqrt{k\log k}$ (and in particular $β$ can remain bounded as the problem dimensions increase). On the other hand, we analyze several polynomial-time estimation algorithms, based on tensor unfolding, power iteration and message passing ideas from graphical models. We show that, unless the signal-to-noise ratio diverges in the system dimensions, none of these approaches succeeds. This is possibly related to a fundamental limitation of computationally tractable estimators for this problem. We discuss various initializations for tensor power iteration, and show that a tractable initialization based on the spectrum of the matricized tensor outperforms significantly baseline methods, statistically and computationally. Finally, we consider the case in which additional side information is available about the unknown signal. We characterize the amount of side information that allows the iterative algorithms to converge to a good estimate.

preprint2014arXiv

Confidence Intervals and Hypothesis Testing for High-Dimensional Regression

Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely challenging to quantify the \emph{uncertainty} associated with a certain parameter estimate. Concretely, no commonly accepted procedure exists for computing classical measures of uncertainty and statistical significance as confidence intervals or $p$-values for these models. We consider here high-dimensional linear regression problem, and propose an efficient algorithm for constructing confidence intervals and $p$-values. The resulting confidence intervals have nearly optimal size. When testing for the null hypothesis that a certain parameter is vanishing, our method has nearly optimal power. Our approach is based on constructing a `de-biased' version of regularized M-estimators. The new construction improves over recent work in the field in that it does not assume a special structure on the design matrix. We test our method on synthetic data and a high-throughput genomic data set about riboflavin production rate.

preprint2014arXiv

Guess Who Rated This Movie: Identifying Users Through Subspace Clustering

It is often the case that, within an online recommender system, multiple users share a common account. Can such shared accounts be identified solely on the basis of the userprovided ratings? Once a shared account is identified, can the different users sharing it be identified as well? Whenever such user identification is feasible, it opens the way to possible improvements in personalized recommendations, but also raises privacy concerns. We develop a model for composite accounts based on unions of linear subspaces, and use subspace clustering for carrying out the identification task. We show that a significant fraction of such accounts is identifiable in a reliable manner, and illustrate potential uses for personalized recommendation.

preprint2014arXiv

Hypothesis Testing in High-Dimensional Regression under the Gaussian Random Design Model: Asymptotic Theory

We consider linear regression in the high-dimensional regime where the number of observations $n$ is smaller than the number of parameters $p$. A very successful approach in this setting uses $\ell_1$-penalized least squares (a.k.a. the Lasso) to search for a subset of $s_0< n$ parameters that best explain the data, while setting the other parameters to zero. Considerable amount of work has been devoted to characterizing the estimation and model selection problems within this approach. In this paper we consider instead the fundamental, but far less understood, question of \emph{statistical significance}. More precisely, we address the problem of computing p-values for single regression coefficients. On one hand, we develop a general upper bound on the minimax power of tests with a given significance level. On the other, we prove that this upper bound is (nearly) achievable through a practical procedure in the case of random design matrices with independent entries. Our approach is based on a debiasing of the Lasso estimator. The analysis builds on a rigorous characterization of the asymptotic distribution of the Lasso estimator and its debiased version. Our result holds for optimal sample size, i.e., when $n$ is at least on the order of $s_0 \log(p/s_0)$. We generalize our approach to random design matrices with i.i.d. Gaussian rows $x_i\sim N(0,Σ)$. In this case we prove that a similar distributional characterization (termed `standard distributional limit') holds for $n$ much larger than $s_0(\log p)^2$. Finally, we show that for optimal sample size, $n$ being at least of order $s_0 \log(p/s_0)$, the standard distributional limit for general Gaussian designs can be derived from the replica heuristics in statistical physics.

preprint2014arXiv

Information-theoretically Optimal Sparse PCA

Sparse Principal Component Analysis (PCA) is a dimensionality reduction technique wherein one seeks a low-rank representation of a data matrix with additional sparsity constraints on the obtained representation. We consider two probabilistic formulations of sparse PCA: a spiked Wigner and spiked Wishart (or spiked covariance) model. We analyze an Approximate Message Passing (AMP) algorithm to estimate the underlying signal and show, in the high dimensional limit, that the AMP estimates are information-theoretically optimal. As an immediate corollary, our results demonstrate that the posterior expectation of the underlying signal, which is often intractable to compute, can be obtained using a polynomial-time scheme. Our results also effectively provide a single-letter characterization of the sparse PCA problem.

preprint2014arXiv

Learning Mixtures of Linear Classifiers

We consider a discriminative learning (regression) problem, whereby the regression function is a convex combination of k linear classifiers. Existing approaches are based on the EM algorithm, or similar techniques, without provable guarantees. We develop a simple method based on spectral techniques and a `mirroring' trick, that discovers the subspace spanned by the classifiers' parameter vectors. Under a probabilistic assumption on the feature vector distribution, we prove that this approach has nearly optimal statistical efficiency.

preprint2014arXiv

Non-negative Principal Component Analysis: Message Passing Algorithms and Sharp Asymptotics

Principal component analysis (PCA) aims at estimating the direction of maximal variability of a high-dimensional dataset. A natural question is: does this task become easier, and estimation more accurate, when we exploit additional knowledge on the principal vector? We study the case in which the principal vector is known to lie in the positive orthant. Similar constraints arise in a number of applications, ranging from analysis of gene expression data to spike sorting in neural signal processing. In the unconstrained case, the estimation performances of PCA has been precisely characterized using random matrix theory, under a statistical model known as the `spiked model.' It is known that the estimation error undergoes a phase transition as the signal-to-noise ratio crosses a certain threshold. Unfortunately, tools from random matrix theory have no bearing on the constrained problem. Despite this challenge, we develop an analogous characterization in the constrained case, within a one-spike model. In particular: $(i)$~We prove that the estimation error undergoes a similar phase transition, albeit at a different threshold in signal-to-noise ratio that we determine exactly; $(ii)$~We prove that --unlike in the unconstrained case-- estimation error depends on the spike vector, and characterize the least favorable vectors; $(iii)$~We show that a non-negative principal component can be approximately computed --under the spiked model-- in nearly linear time. This despite the fact that the problem is non-convex and, in general, NP-hard to solve exactly.

preprint2014arXiv

On the limitation of spectral methods: From the Gaussian hidden clique problem to rank one perturbations of Gaussian tensors

We consider the following detection problem: given a realization of a symmetric matrix ${\mathbf{X}}$ of dimension $n$, distinguish between the hypothesis that all upper triangular variables are i.i.d. Gaussians variables with mean 0 and variance $1$ and the hypothesis where ${\mathbf{X}}$ is the sum of such matrix and an independent rank-one perturbation. This setup applies to the situation where under the alternative, there is a planted principal submatrix ${\mathbf{B}}$ of size $L$ for which all upper triangular variables are i.i.d. Gaussians with mean $1$ and variance $1$, whereas all other upper triangular elements of ${\mathbf{X}}$ not in ${\mathbf{B}}$ are i.i.d. Gaussians variables with mean 0 and variance $1$. We refer to this as the `Gaussian hidden clique problem.' When $L=(1+ε)\sqrt{n}$ ($ε>0$), it is possible to solve this detection problem with probability $1-o_n(1)$ by computing the spectrum of ${\mathbf{X}}$ and considering the largest eigenvalue of ${\mathbf{X}}$. We prove that this condition is tight in the following sense: when $L<(1-ε)\sqrt{n}$ no algorithm that examines only the eigenvalues of ${\mathbf{X}}$ can detect the existence of a hidden Gaussian clique, with error probability vanishing as $n\to\infty$. We prove this result as an immediate consequence of a more general result on rank-one perturbations of $k$-dimensional Gaussian tensors. In this context we establish a lower bound on the critical signal-to-noise ratio below which a rank-one signal cannot be detected.

preprint2014arXiv

Privacy Tradeoffs in Predictive Analytics

Online services routinely mine user data to predict user preferences, make recommendations, and place targeted ads. Recent research has demonstrated that several private user attributes (such as political affiliation, sexual orientation, and gender) can be inferred from such data. Can a privacy-conscious user benefit from personalization while simultaneously protecting her private attributes? We study this question in the context of a rating prediction service based on matrix factorization. We construct a protocol of interactions between the service and users that has remarkable optimality properties: it is privacy-preserving, in that no inference algorithm can succeed in inferring a user's private attribute with a probability better than random guessing; it has maximal accuracy, in that no other privacy-preserving protocol improves rating prediction; and, finally, it involves a minimal disclosure, as the prediction accuracy strictly decreases when the service reveals less information. We extensively evaluate our protocol using several rating datasets, demonstrating that it successfully blocks the inference of gender, age and political affiliation, while incurring less than 5% decrease in the accuracy of rating prediction.

preprint2014arXiv

Statistical Estimation: From Denoising to Sparse Regression and Hidden Cliques

These notes review six lectures given by Prof. Andrea Montanari on the topic of statistical estimation for linear models. The first two lectures cover the principles of signal recovery from linear measurements in terms of minimax risk. Subsequent lectures demonstrate the application of these principles to several practical problems in science and engineering. Specifically, these topics include denoising of error-laden signals, recovery of compressively sensed signals, reconstruction of low-rank matrices, and also the discovery of hidden cliques within large networks.

preprint2013arXiv

Accelerated Time-of-Flight Mass Spectrometry

We study a simple modification to the conventional time of flight mass spectrometry (TOFMS) where a \emph{variable} and (pseudo)-\emph{random} pulsing rate is used which allows for traces from different pulses to overlap. This modification requires little alteration to the currently employed hardware. However, it requires a reconstruction method to recover the spectrum from highly aliased traces. We propose and demonstrate an efficient algorithm that can process massive TOFMS data using computational resources that can be considered modest with today's standards. This approach can be used to improve duty cycle, speed, and mass resolving power of TOFMS at the same time. We expect this to extend the applicability of TOFMS to new domains.

preprint2013arXiv

Accurate Prediction of Phase Transitions in Compressed Sensing via a Connection to Minimax Denoising

Compressed sensing posits that, within limits, one can undersample a sparse signal and yet reconstruct it accurately. Knowing the precise limits to such undersampling is important both for theory and practice. We present a formula that characterizes the allowed undersampling of generalized sparse objects. The formula applies to Approximate Message Passing (AMP) algorithms for compressed sensing, which are here generalized to employ denoising operators besides the traditional scalar soft thresholding denoiser. This paper gives several examples including scalar denoisers not derived from convex penalization -- the firm shrinkage nonlinearity and the minimax nonlinearity -- and also nonscalar denoisers -- block thresholding, monotone regression, and total variation minimization. Let the variables eps = k/N and delta = n/N denote the generalized sparsity and undersampling fractions for sampling the k-generalized-sparse N-vector x_0 according to y=Ax_0. Here A is an n\times N measurement matrix whose entries are iid standard Gaussian. The formula states that the phase transition curve delta = delta(eps) separating successful from unsuccessful reconstruction of x_0 by AMP is given by: delta = M(eps| Denoiser), where M(eps| Denoiser) denotes the per-coordinate minimax mean squared error (MSE) of the specified, optimally-tuned denoiser in the directly observed problem y = x + z. In short, the phase transition of a noiseless undersampling problem is identical to the minimax MSE in a denoising problem.

preprint2013arXiv

Conditional Random Fields, Planted Constraint Satisfaction, and Entropy Concentration

This paper studies a class of probabilistic models on graphs, where edge variables depend on incident node variables through a fixed probability kernel. The class includes planted con- straint satisfaction problems (CSPs), as well as more general structures motivated by coding and community clustering problems. It is shown that under mild assumptions on the kernel and for sparse random graphs, the conditional entropy of the node variables given the edge variables concentrates around a deterministic threshold. This implies in particular the concentration of the number of solutions in a broad class of planted CSPs, the existence of a threshold function for the disassortative stochastic block model, and the proof of a conjecture on parity check codes. It also establishes new connections among coding, clustering and satisfiability.

preprint2013arXiv

Factor models on locally tree-like graphs

We consider homogeneous factor models on uniformly sparse graph sequences converging locally to a (unimodular) random tree $T$, and study the existence of the free energy density $ϕ$, the limit of the log-partition function divided by the number of vertices $n$ as $n$ tends to infinity. We provide a new interpolation scheme and use it to prove existence of, and to explicitly compute, the quantity $ϕ$ subject to uniqueness of a relevant Gibbs measure for the factor model on $T$. By way of example we compute $ϕ$ for the independent set (or hard-core) model at low fugacity, for the ferromagnetic Ising model at all parameter values, and for the ferromagnetic Potts model with both weak enough and strong enough interactions. Even beyond uniqueness regimes our interpolation provides useful explicit bounds on $ϕ$. In the regimes in which we establish existence of the limit, we show that it coincides with the Bethe free energy functional evaluated at a suitable fixed point of the belief propagation (Bethe) recursions on $T$. In the special case that $T$ has a Galton-Watson law, this formula coincides with the nonrigorous "Bethe prediction" obtained by statistical physicists using the "replica" or "cavity" methods. Thus our work is a rigorous generalization of these heuristic calculations to the broader class of sparse graph sequences converging locally to trees. We also provide a variational characterization for the Bethe prediction in this general setting, which is of independent interest.

preprint2013arXiv

Finding Hidden Cliques of Size \sqrt{N/e} in Nearly Linear Time

Consider an Erdös-Renyi random graph in which each edge is present independently with probability 1/2, except for a subset $\sC_N$ of the vertices that form a clique (a completely connected subgraph). We consider the problem of identifying the clique, given a realization of such a random graph. The best known algorithm provably finds the clique in linear time with high probability, provided $|\sC_N|\ge 1.261\sqrt{N}$ \cite{dekel2011finding}. Spectral methods can be shown to fail on cliques smaller than $\sqrt{N}$. In this paper we describe a nearly linear time algorithm that succeeds with high probability for $|\sC_N|\ge (1+\eps)\sqrt{N/e}$ for any $\eps>0$. This is the first algorithm that provably improves over spectral methods. We further generalize the hidden clique problem to other background graphs (the standard case corresponding to the complete graph on $N$ vertices). For large girth regular graphs of degree $(Δ+1)$ we prove that `local' algorithms succeed if $|\sC_N|\ge (1+\eps)N/\sqrt{eΔ}$ and fail if $|\sC_N|\le(1-\eps)N/\sqrt{eΔ}$.

preprint2013arXiv

High Dimensional Robust M-Estimation: Asymptotic Variance via Approximate Message Passing

In a recent article (Proc. Natl. Acad. Sci., 110(36), 14557-14562), El Karoui et al. study the distribution of robust regression estimators in the regime in which the number of parameters p is of the same order as the number of samples n. Using numerical simulations and `highly plausible' heuristic arguments, they unveil a striking new phenomenon. Namely, the regression coefficients contain an extra Gaussian noise component that is not explained by classical concepts such as the Fisher information matrix. We show here that that this phenomenon can be characterized rigorously techniques that were developed by the authors to analyze the Lasso estimator under high-dimensional asymptotics. We introduce an approximate message passing (AMP) algorithm to compute M-estimators and deploy state evolution to evaluate the operating characteristics of AMP and so also M-estimates. Our analysis clarifies that the `extra Gaussian noise' encountered in this problem is fundamentally similar to phenomena already studied for regularized least squares in the setting n<p.

preprint2013arXiv

Information-Theoretically Optimal Compressed Sensing via Spatial Coupling and Approximate Message Passing

We study the compressed sensing reconstruction problem for a broad class of random, band-diagonal sensing matrices. This construction is inspired by the idea of spatial coupling in coding theory. As demonstrated heuristically and numerically by Krzakala et al. \cite{KrzakalaEtAl}, message passing algorithms can effectively solve the reconstruction problem for spatially coupled measurements with undersampling rates close to the fraction of non-zero coordinates. We use an approximate message passing (AMP) algorithm and analyze it through the state evolution method. We give a rigorous proof that this approach is successful as soon as the undersampling rate $δ$ exceeds the (upper) Rényi information dimension of the signal, $\uRenyi(p_X)$. More precisely, for a sequence of signals of diverging dimension $n$ whose empirical distribution converges to $p_X$, reconstruction is with high probability successful from $\uRenyi(p_X)\, n+o(n)$ measurements taken according to a band diagonal matrix. For sparse signals, i.e., sequences of dimension $n$ and $k(n)$ non-zero entries, this implies reconstruction from $k(n)+o(n)$ measurements. For `discrete' signals, i.e., signals whose coordinates take a fixed finite set of values, this implies reconstruction from $o(n)$ measurements. The result is robust with respect to noise, does not apply uniquely to random signals, but requires the knowledge of the empirical distribution of the signal $p_X$.

preprint2013arXiv

Linear Bandits in High Dimension and Recommendation Systems

A large number of online services provide automated recommendations to help users to navigate through a large collection of items. New items (products, videos, songs, advertisements) are suggested on the basis of the user's past history and --when available-- her demographic profile. Recommendations have to satisfy the dual goal of helping the user to explore the space of available items, while allowing the system to probe the user's preferences. We model this trade-off using linearly parametrized multi-armed bandits, propose a policy and prove upper and lower bounds on the cumulative "reward" that coincide up to constants in the data poor (high-dimensional) regime. Prior work on linear bandits has focused on the data rich (low-dimensional) regime and used cumulative "risk" as the figure of merit. For this data rich regime, we provide a simple modification for our policy that achieves near-optimal risk performance under more restrictive assumptions on the geometry of the problem. We test (a variation of) the scheme used for establishing achievability on the Netflix and MovieLens datasets and obtain good agreement with the qualitative predictions of the theory we develop.

preprint2013arXiv

Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition

In the high-dimensional regression model a response variable is linearly related to $p$ covariates, but the sample size $n$ is smaller than $p$. We assume that only a small subset of covariates is `active' (i.e., the corresponding coefficients are non-zero), and consider the model-selection problem of identifying the active covariates. A popular approach is to estimate the regression coefficients through the Lasso ($\ell_1$-regularized least squares). This is known to correctly identify the active set only if the irrelevant covariates are roughly orthogonal to the relevant ones, as quantified through the so called `irrepresentability' condition. In this paper we study the `Gauss-Lasso' selector, a simple two-stage method that first solves the Lasso, and then performs ordinary least squares restricted to the Lasso active set. We formulate `generalized irrepresentability condition' (GIC), an assumption that is substantially weaker than irrepresentability. We prove that, under GIC, the Gauss-Lasso correctly recovers the active set.

preprint2013arXiv

Nearly Optimal Sample Size in Hypothesis Testing for High-Dimensional Regression

We consider the problem of fitting the parameters of a high-dimensional linear regression model. In the regime where the number of parameters $p$ is comparable to or exceeds the sample size $n$, a successful approach uses an $\ell_1$-penalized least squares estimator, known as Lasso. Unfortunately, unlike for linear estimators (e.g., ordinary least squares), no well-established method exists to compute confidence intervals or p-values on the basis of the Lasso estimator. Very recently, a line of work \cite{javanmard2013hypothesis, confidenceJM, GBR-hypothesis} has addressed this problem by constructing a debiased version of the Lasso estimator. In this paper, we study this approach for random design model, under the assumption that a good estimator exists for the precision matrix of the design. Our analysis improves over the state of the art in that it establishes nearly optimal \emph{average} testing power if the sample size $n$ asymptotically dominates $s_0 (\log p)^2$, with $s_0$ being the sparsity level (number of non-zero coefficients). Earlier work obtains provable guarantees only for much larger sample size, namely it requires $n$ to asymptotically dominate $(s_0 \log p)^2$. In particular, for random designs with a sparse precision matrix we show that an estimator thereof having the required properties can be computed efficiently. Finally, we evaluate this approach on synthetic data and compare it with earlier proposals.

preprint2013arXiv

The Phase Transition of Matrix Recovery from Gaussian Measurements Matches the Minimax MSE of Matrix Denoising

Let $X_0$ be an unknown $M$ by $N$ matrix. In matrix recovery, one takes $n < MN$ linear measurements $y_1,..., y_n$ of $X_0$, where $y_i = \Tr(a_i^T X_0)$ and each $a_i$ is a $M$ by $N$ matrix. For measurement matrices with Gaussian i.i.d entries, it known that if $X_0$ is of low rank, it is recoverable from just a few measurements. A popular approach for matrix recovery is Nuclear Norm Minimization (NNM). Empirical work reveals a \emph{phase transition} curve, stated in terms of the undersampling fraction $δ(n,M,N) = n/(MN)$, rank fraction $ρ=r/N$ and aspect ratio $β=M/N$. Specifically, a curve $δ^* = δ^*(ρ;β)$ exists such that, if $δ> δ^*(ρ;β)$, NNM typically succeeds, while if $δ< δ^*(ρ;β)$, it typically fails. An apparently quite different problem is matrix denoising in Gaussian noise, where an unknown $M$ by $N$ matrix $X_0$ is to be estimated based on direct noisy measurements $Y = X_0 + Z$, where the matrix $Z$ has iid Gaussian entries. It has been empirically observed that, if $X_0$ has low rank, it may be recovered quite accurately from the noisy measurement $Y$. A popular matrix denoising scheme solves the unconstrained optimization problem $\text{min} \| Y - X \|_F^2/2 + λ\|X\|_* $. When optimally tuned, this scheme achieves the asymptotic minimax MSE $\cM(ρ) = \lim_{N \goto \infty} \inf_λ\sup_{\rank(X) \leq ρ\cdot N} MSE(X,\hat{X}_λ)$. We report extensive experiments showing that the phase transition $δ^*(ρ)$ in the first problem coincides with the minimax risk curve $\cM(ρ)$ in the second problem, for {\em any} rank fraction $0 < ρ< 1$.

preprint2012arXiv

Guess Who Rated This Movie: Identifying Users Through Subspace Clustering

It is often the case that, within an online recommender system, multiple users share a common account. Can such shared accounts be identified solely on the basis of the user- provided ratings? Once a shared account is identified, can the different users sharing it be identified as well? Whenever such user identification is feasible, it opens the way to possible improvements in personalized recommendations, but also raises privacy concerns. We develop a model for composite accounts based on unions of linear subspaces, and use subspace clustering for carrying out the identification task. We show that a significant fraction of such accounts is identifiable in a reliable manner, and illustrate potential uses for personalized recommendation.

preprint2012arXiv

Identifying Users From Their Rating Patterns

This paper reports on our analysis of the 2011 CAMRa Challenge dataset (Track 2) for context-aware movie recommendation systems. The train dataset comprises 4,536,891 ratings provided by 171,670 users on 23,974$ movies, as well as the household groupings of a subset of the users. The test dataset comprises 5,450 ratings for which the user label is missing, but the household label is provided. The challenge required to identify the user labels for the ratings in the test set. Our main finding is that temporal information (time labels of the ratings) is significantly more useful for achieving this objective than the user preferences (the actual ratings). Using a model that leverages on this fact, we are able to identify users within a known household with an accuracy of approximately 96% (i.e. misclassification rate around 4%).

preprint2012arXiv

Localization from Incomplete Noisy Distance Measurements

We consider the problem of positioning a cloud of points in the Euclidean space $\mathbb{R}^d$, using noisy measurements of a subset of pairwise distances. This task has applications in various areas, such as sensor network localization and reconstruction of protein conformations from NMR measurements. Also, it is closely related to dimensionality reduction problems and manifold learning, where the goal is to learn the underlying global geometry of a data set using local (or partial) metric information. Here we propose a reconstruction algorithm based on semidefinite programming. For a random geometric graph model and uniformly bounded noise, we provide a precise characterization of the algorithm's performance: In the noiseless case, we find a radius $r_0$ beyond which the algorithm reconstructs the exact positions (up to rigid transformations). In the presence of noise, we obtain upper and lower bounds on the reconstruction error that match up to a factor that depends only on the dimension $d$, and the average degree of the nodes in the graph.

preprint2012arXiv

Matrix Completion from Noisy Entries

Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the `Netflix problem') to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan et al.(2009), based on a combination of spectral techniques and manifold optimization, that we call here OptSpace. We prove performance guarantees that are order-optimal in a number of circumstances.

preprint2012arXiv

State Evolution for General Approximate Message Passing Algorithms, with Applications to Spatial Coupling

We consider a class of approximated message passing (AMP) algorithms and characterize their high-dimensional behavior in terms of a suitable state evolution recursion. Our proof applies to Gaussian matrices with independent but not necessarily identically distributed entries. It covers --in particular-- the analysis of generalized AMP, introduced by Rangan, and of AMP reconstruction in compressed sensing with spatially coupled sensing matrices. The proof technique builds on the one of [BM11], while simplifying and generalizing several steps.

preprint2012arXiv

Subsampling at Information Theoretically Optimal Rates

We study the problem of sampling a random signal with sparse support in frequency domain. Shannon famously considered a scheme that instantaneously samples the signal at equispaced times. He proved that the signal can be reconstructed as long as the sampling rate exceeds twice the bandwidth (Nyquist rate). Candès, Romberg, Tao introduced a scheme that acquires instantaneous samples of the signal at random times. They proved that the signal can be uniquely and efficiently reconstructed, provided the sampling rate exceeds the frequency support of the signal, times logarithmic factors. In this paper we consider a probabilistic model for the signal, and a sampling scheme inspired by the idea of spatial coupling in coding theory. Namely, we propose to acquire non-instantaneous samples at random times. Mathematically, this is implemented by acquiring a small random subset of Gabor coefficients. We show empirically that this scheme achieves correct reconstruction as soon as the sampling rate exceeds the frequency support of the signal, thus reaching the information theoretic limit.

preprint2012arXiv

The replica symmetric solution for Potts models on d-regular graphs

We provide an explicit formula for the limiting free energy density (log-partition function divided by the number of vertices) for ferromagnetic Potts models on uniformly sparse graph sequences converging locally to the d-regular tree for d even, covering all temperature regimes. This formula coincides with the Bethe free energy functional evaluated at a suitable fixed point of the belief propagation recursion on the d-regular tree, the so-called replica symmetric solution. For uniformly random d-regular graphs we further show that the replica symmetric Bethe formula is an upper bound for the asymptotic free energy for any model with permissive interactions.

preprint2011arXiv

Bargaining dynamics in exchange networks

We consider a one-sided assignment market or exchange network with transferable utility and propose a model for the dynamics of bargaining in such a market. Our dynamical model is local, involving iterative updates of 'offers' based on estimated best alternative matches, in the spirit of pairwise Nash bargaining. We establish that when a balanced outcome (a generalization of the pairwise Nash bargaining solution to networks) exists, our dynamics converges rapidly to such an outcome. We extend our results to the cases of (i) general agent 'capacity constraints', i.e., an agent may be allowed to participate in multiple matches, and (ii) 'unequal bargaining powers' (where we also find a surprising change in rate of convergence).

preprint2011arXiv

Compressed Sensing over $\ell_p$-balls: Minimax Mean Square Error

We consider the compressed sensing problem, where the object $x_0 \in \bR^N$ is to be recovered from incomplete measurements $y = Ax_0 + z$; here the sensing matrix $A$ is an $n \times N$ random matrix with iid Gaussian entries and $n < N$. A popular method of sparsity-promoting reconstruction is $\ell^1$-penalized least-squares reconstruction (aka LASSO, Basis Pursuit). It is currently popular to consider the strict sparsity model, where the object $x_0$ is nonzero in only a small fraction of entries. In this paper, we instead consider the much more broadly applicable $\ell_p$-sparsity model, where $x_0$ is sparse in the sense of having $\ell_p$ norm bounded by $ξ\cdot N^{1/p}$ for some fixed $0 < p \leq 1$ and $ξ> 0$. We study an asymptotic regime in which $n$ and $N$ both tend to infinity with limiting ratio $n/N = δ\in (0,1)$, both in the noisy ($z \neq 0$) and noiseless ($z=0$) cases. Under weak assumptions on $x_0$, we are able to precisely evaluate the worst-case asymptotic minimax mean-squared reconstruction error (AMSE) for $\ell^1$ penalized least-squares: min over penalization parameters, max over $\ell_p$-sparse objects $x_0$. We exhibit the asymptotically least-favorable object (hardest sparse signal to recover) and the maximin penalization. Our explicit formulas unexpectedly involve quantities appearing classically in statistical decision theory. Occurring in the present setting, they reflect a deeper connection between penalized $\ell^1$ minimization and scalar soft thresholding. This connection, which follows from earlier work of the authors and collaborators on the AMP iterative thresholding algorithm, is carefully explained. Our approach also gives precise results under weak-$\ell_p$ ball coefficient constraints, as we show here.

preprint2011arXiv

Distributed Storage for Intermittent Energy Sources: Control Design and Performance Limits

One of the most important challenges in the integration of renewable energy sources into the power grid lies in their `intermittent' nature. The power output of sources like wind and solar varies with time and location due to factors that cannot be controlled by the provider. Two strategies have been proposed to hedge against this variability: 1) use energy storage systems to effectively average the produced power over time; 2) exploit distributed generation to effectively average production over location. We introduce a network model to study the optimal use of storage and transmission resources in the presence of random energy sources. We propose a Linear-Quadratic based methodology to design control strategies, and we show that these strategies are asymptotically optimal for some simple network topologies. For these topologies, the dependence of optimal performance on storage and transmission capacity is explicitly quantified.

preprint2011arXiv

Gossip PCA

Eigenvectors of data matrices play an important role in many computational problems, ranging from signal processing to machine learning and control. For instance, algorithms that compute positions of the nodes of a wireless network on the basis of pairwise distance measurements require a few leading eigenvectors of the distances matrix. While eigenvector calculation is a standard topic in numerical linear algebra, it becomes challenging under severe communication or computation constraints, or in absence of central scheduling. In this paper we investigate the possibility of computing the leading eigenvectors of a large data matrix through gossip algorithms. The proposed algorithm amounts to iteratively multiplying a vector by independent random sparsification of the original matrix and averaging the resulting normalized vectors. This can be viewed as a generalization of gossip algorithms for consensus, but the resulting dynamics is significantly more intricate. Our analysis is based on controlling the convergence to stationarity of the associated Kesten-Furstenberg Markov chain.

preprint2011arXiv

Graphical Models Concepts in Compressed Sensing

This paper surveys recent work in applying ideas from graphical models and message passing algorithms to solve large scale regularized regression problems. In particular, the focus is on compressed sensing reconstruction via ell_1 penalized least-squares (known as LASSO or BPDN). We discuss how to derive fast approximate message passing algorithms to solve this problem. Surprisingly, the analysis of such algorithms allows to prove exact high-dimensional limit results for the LASSO risk. This paper will appear as a chapter in a book on `Compressed Sensing' edited by Yonina Eldar and Gitta Kutyniok.

preprint2011arXiv

Information Theoretic Limits on Learning Stochastic Differential Equations

Consider the problem of learning the drift coefficient of a stochastic differential equation from a sample path. In this paper, we assume that the drift is parametrized by a high dimensional vector. We address the question of how long the system needs to be observed in order to learn this vector of parameters. We prove a general lower bound on this time complexity by using a characterization of mutual information as time integral of conditional variance, due to Kadota, Zakai, and Ziv. This general lower bound is applied to specific classes of linear and non-linear stochastic differential equations. In the linear case, the problem under consideration is the one of learning a matrix of interaction coefficients. We evaluate our lower bound for ensembles of sparse and dense random matrices. The resulting estimates match the qualitative behavior of upper bounds achieved by computationally efficient procedures.

preprint2011arXiv

Majority dynamics on trees and the dynamic cavity method

A voter sits on each vertex of an infinite tree of degree $k$, and has to decide between two alternative opinions. At each time step, each voter switches to the opinion of the majority of her neighbors. We analyze this majority process when opinions are initialized to independent and identically distributed random variables. In particular, we bound the threshold value of the initial bias such that the process converges to consensus. In order to prove an upper bound, we characterize the process of a single node in the large $k$-limit. This approach is inspired by the theory of mean field spin-glass and can potentially be generalized to a wider class of models. We also derive a lower bound that is nontrivial for small, odd values of $k$.

preprint2011arXiv

On the trade-off between complexity and correlation decay in structural learning algorithms

We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms often fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it).

preprint2011arXiv

Optimal coding for the deletion channel with small deletion probability

The deletion channel is the simplest point-to-point communication channel that models lack of synchronization. Input bits are deleted independently with probability d, and when they are not deleted, they are not affected by the channel. Despite significant effort, little is known about the capacity of this channel, and even less about optimal coding schemes. In this paper we develop a new systematic approach to this problem, by demonstrating that capacity can be computed in a series expansion for small deletion probability. We compute three leading terms of this expansion, and find an input distribution that achieves capacity up to this order. This constitutes the first optimal coding result for the deletion channel. The key idea employed is the following: We understand perfectly the deletion channel with deletion probability d=0. It has capacity 1 and the optimal input distribution is i.i.d. Bernoulli(1/2). It is natural to expect that the channel with small deletion probabilities has a capacity that varies smoothly with d, and that the optimal input distribution is obtained by smoothly perturbing the i.i.d. Bernoulli(1/2) process. Our results show that this is indeed the case. We think that this general strategy can be useful in a number of capacity calculations.

preprint2011arXiv

Robust Max-Product Belief Propagation

We study the problem of optimizing a graph-structured objective function under \emph{adversarial} uncertainty. This problem can be modeled as a two-persons zero-sum game between an Engineer and Nature. The Engineer controls a subset of the variables (nodes in the graph), and tries to assign their values to maximize an objective function. Nature controls the complementary subset of variables and tries to minimize the same objective. This setting encompasses estimation and optimization problems under model uncertainty, and strategic problems with a graph structure. Von Neumann's minimax theorem guarantees the existence of a (minimax) pair of randomized strategies that provide optimal robustness for each player against its adversary. We prove several structural properties of this strategy pair in the case of graph-structured payoff function. In particular, the randomized minimax strategies (distributions over variable assignments) can be chosen in such a way to satisfy the Markov property with respect to the graph. This significantly reduces the problem dimensionality. Finally we introduce a message passing algorithm to solve this minimax problem. The algorithm generalizes max-product belief propagation to this new domain.

preprint2011arXiv

Subexponential convergence for information aggregation on regular trees

We consider the decentralized binary hypothesis testing problem on trees of bounded degree and increasing depth. For a regular tree of depth t and branching factor k>=2, we assume that the leaves have access to independent and identically distributed noisy observations of the 'state of the world' s. Starting with the leaves, each node makes a decision in a finite alphabet M, that it sends to its parent in the tree. Finally, the root decides between the two possible states of the world based on the information it receives. We prove that the error probability vanishes only subexponentially in the number of available observations, under quite general hypotheses. More precisely the case of binary messages, decay is subexponential for any decision rule. For general (finite) message alphabet M, decay is subexponential for 'node-oblivious' decision rules, that satisfy a mild irreducibility condition. In the latter case, we propose a family of decision rules with close-to-optimal asymptotic behavior.

preprint2011arXiv

The dynamics of message passing on dense graphs, with applications to compressed sensing

Approximate message passing algorithms proved to be extremely effective in reconstructing sparse signals from a small number of incoherent linear measurements. Extensive numerical experiments further showed that their dynamics is accurately tracked by a simple one-dimensional iteration termed state evolution. In this paper we provide the first rigorous foundation to state evolution. We prove that indeed it holds asymptotically in the large system limit for sensing matrices with independent and identically distributed gaussian entries. While our focus is on message passing algorithms for compressed sensing, the analysis extends beyond this setting, to a general class of algorithms on dense graphs. In this context, state evolution plays the role that density evolution has for sparse graphs. The proof technique is fundamentally different from the standard approach to density evolution, in that it copes with large number of short loops in the underlying factor graph. It relies instead on a conditioning technique recently developed by Erwin Bolthausen in the context of spin glass theory.

preprint2010arXiv

A Natural Dynamics for Bargaining on Exchange Networks

Bargaining networks model the behavior of a set of players that need to reach pairwise agreements for making profits. Nash bargaining solutions are special outcomes of such games that are both stable and balanced. Kleinberg and Tardos proved a sharp algorithmic characterization of such outcomes, but left open the problem of how the actual bargaining process converges to them. A partial answer was provided by Azar et al. who proposed a distributed algorithm for constructing Nash bargaining solutions, but without polynomial bounds on its convergence rate. In this paper, we introduce a simple and natural model for this process, and study its convergence rate to Nash bargaining solutions. At each time step, each player proposes a deal to each of her neighbors. The proposal consists of a share of the potential profit in case of agreement. The share is chosen to be balanced in Nash's sense as far as this is feasible (with respect to the current best alternatives for both players). We prove that, whenever the Nash bargaining solution is unique (and satisfies a positive gap condition) this dynamics converges to it in polynomial time. Our analysis is based on an approximate decoupling phenomenon between the dynamics on different substructures of the network. This approach may be of general interest for the analysis of local algorithms on networks.

preprint2010arXiv

Applications of Lindeberg Principle in Communications and Statistical Learning

We use a generalization of the Lindeberg principle developed by Sourav Chatterjee to prove universality properties for various problems in communications, statistical learning and random matrix theory. We also show that these systems can be viewed as the limiting case of a properly defined sparse system. The latter result is useful when the sparse systems are easier to analyze than their dense counterparts. The list of problems we consider is by no means exhaustive. We believe that the ideas can be used in many other problems relevant for information theory.

preprint2010arXiv

Ising models on locally tree-like graphs

We consider ferromagnetic Ising models on graphs that converge locally to trees. Examples include random regular graphs with bounded degree and uniformly random graphs with bounded average degree. We prove that the "cavity" prediction for the limiting free energy per spin is correct for any positive temperature and external field. Further, local marginals can be approximated by iterating a set of mean field (cavity) equations. Both results are achieved by proving the local convergence of the Boltzmann distribution on the original graph to the Boltzmann distribution on the appropriate infinite random tree.

preprint2010arXiv

Learning Networks of Stochastic Differential Equations

We consider linear models for stochastic dynamics. To any such model can be associated a network (namely a directed graph) describing which degrees of freedom interact under the dynamics. We tackle the problem of learning such a network from observation of the system trajectory over a time interval $T$. We analyze the $\ell_1$-regularized least squares algorithm and, in the setting in which the underlying network is sparse, we prove performance guarantees that are \emph{uniform in the sampling rate} as long as this is sufficiently high. This result substantiates the notion of a well defined `time complexity' for the network inference problem.

preprint2010arXiv

Lossy compression of discrete sources via Viterbi algorithm

We present a new lossy compressor for discrete-valued sources. For coding a sequence $x^n$, the encoder starts by assigning a certain cost to each possible reconstruction sequence. It then finds the one that minimizes this cost and describes it losslessly to the decoder via a universal lossless compressor. The cost of each sequence is a linear combination of its distance from the sequence $x^n$ and a linear function of its $k^{\rm th}$ order empirical distribution. The structure of the cost function allows the encoder to employ the Viterbi algorithm to recover the minimizer of the cost. We identify a choice of the coefficients comprising the linear function of the empirical distribution used in the cost function which ensures that the algorithm universally achieves the optimum rate-distortion performance of any stationary ergodic source in the limit of large $n$, provided that $k$ diverges as $o(\log n)$. Iterative techniques for approximating the coefficients, which alleviate the computational burden of finding the optimal coefficients, are proposed and studied.

preprint2010arXiv

On the concentration of the number of solutions of random satisfiability formulas

Let $Z(F)$ be the number of solutions of a random $k$-satisfiability formula $F$ with $n$ variables and clause density $α$. Assume that the probability that $F$ is unsatisfiable is $O(1/\log(n)^{1+\e})$ for $\e>0$. We show that (possibly excluding a countable set of `exceptional' $α$'s) the number of solutions concentrate in the logarithmic scale, i.e., there exists a non-random function $ϕ(α)$ such that, for any $δ>0$, $(1/n)\log Z(F)\in [ϕ-δ,ϕ+δ]$ with high probability. In particular, the assumption holds for all $α<1$, which proves the above concentration claim in the whole satisfiability regime of random $2$-SAT. We also extend these results to a broad class of constraint satisfaction problems. The proof is based on an interpolation technique from spin-glass theory, and on an application of Friedgut's theorem on sharp thresholds for graph properties.

preprint2010arXiv

On the deletion channel with small deletion probability

The deletion channel is the simplest point-to-point communication channel that models lack of synchronization. Despite significant effort, little is known about its capacity, and even less about optimal coding schemes. In this paper we intiate a new systematic approach to this problem, by demonstrating that capacity can be computed in a series expansion for small deletion probability. We compute two leading terms of this expansion, and show that capacity is achieved, up to this order, by i.i.d. uniform random distribution of the input. We think that this strategy can be useful in a number of capacity calculations.

preprint2010arXiv

Regularization for Matrix Completion

We consider the problem of reconstructing a low rank matrix from noisy observations of a subset of its entries. This task has applications in statistical learning, computer vision, and signal processing. In these contexts, "noise" generically refers to any contribution to the data that is not captured by the low-rank model. In most applications, the noise level is large compared to the underlying signal and it is important to avoid overfitting. In order to tackle this problem, we define a regularized cost function well suited for spectral reconstruction methods. Within a random noise model, and in the large system limit, we prove that the resulting accuracy undergoes a phase transition depending on the noise level and on the fraction of observed entries. The cost function can be minimized using OPTSPACE (a manifold gradient descent algorithm). Numerical simulations show that this approach is competitive with state-of-the-art alternatives.

preprint2010arXiv

The Noise-Sensitivity Phase Transition in Compressed Sensing

Consider the noisy underdetermined system of linear equations: y=Ax0 + z0, with n x N measurement matrix A, n < N, and Gaussian white noise z0 ~ N(0,σ^2 I). Both y and A are known, both x0 and z0 are unknown, and we seek an approximation to x0. When x0 has few nonzeros, useful approximations are obtained by l1-penalized l2 minimization, in which the reconstruction \hxl solves min || y - Ax||^2/2 + λ||x||_1. Evaluate performance by mean-squared error (MSE = E ||\hxl - x0||_2^2/N). Consider matrices A with iid Gaussian entries and a large-system limit in which n,N\to\infty with n/N \to δand k/n \to ρ. Call the ratio MSE/σ^2 the noise sensitivity. We develop formal expressions for the MSE of \hxl, and evaluate its worst-case formal noise sensitivity over all types of k-sparse signals. The phase space 0 < δ, ρ< 1 is partitioned by curve ρ= \rhoMSE(δ) into two regions. Formal noise sensitivity is bounded throughout the region ρ< \rhoMSE(δ) and is unbounded throughout the region ρ> \rhoMSE(δ). The phase boundary ρ= \rhoMSE(δ) is identical to the previously-known phase transition curve for equivalence of l1 - l0 minimization in the k-sparse noiseless case. Hence a single phase boundary describes the fundamental phase transitions both for the noiseless and noisy cases. Extensive computational experiments validate the predictions of this formalism, including the existence of game theoretical structures underlying it. Underlying our formalism is the AMP algorithm introduced earlier by the authors. Other papers by the authors detail expressions for the formal MSE of AMP and its close connection to l1-penalized reconstruction. Here we derive the minimax formal MSE of AMP and then read out results for l1-penalized reconstruction.

preprint2010arXiv

Tight Thresholds for Cuckoo Hashing via XORSAT

We settle the question of tight thresholds for offline cuckoo hashing. The problem can be stated as follows: we have n keys to be hashed into m buckets each capable of holding a single key. Each key has k >= 3 (distinct) associated buckets chosen uniformly at random and independently of the choices of other keys. A hash table can be constructed successfully if each key can be placed into one of its buckets. We seek thresholds alpha_k such that, as n goes to infinity, if n/m <= alpha for some alpha < alpha_k then a hash table can be constructed successfully with high probability, and if n/m >= alpha for some alpha > alpha_k a hash table cannot be constructed successfully with high probability. Here we are considering the offline version of the problem, where all keys and hash values are given, so the problem is equivalent to previous models of multiple-choice hashing. We find the thresholds for all values of k > 2 by showing that they are in fact the same as the previously known thresholds for the random k-XORSAT problem. We then extend these results to the setting where keys can have differing number of choices, and provide evidence in the form of an algorithm for a conjecture extending this result to cuckoo hash tables that store multiple keys in a bucket.

preprint2009arXiv

Message Passing Algorithms for Compressed Sensing

Compressed sensing aims to undersample certain high-dimensional signals, yet accurately reconstruct them by exploiting signal characteristics. Accurate reconstruction is possible when the object to be recovered is sufficiently sparse in a known basis. Currently, the best known sparsity-undersampling tradeoff is achieved when reconstructing by convex optimization -- which is expensive in important large-scale applications. Fast iterative thresholding algorithms have been intensively studied as alternatives to convex optimization for large-scale problems. Unfortunately known fast algorithms offer substantially worse sparsity-undersampling tradeoffs than convex optimization. We introduce a simple costless modification to iterative thresholding making the sparsity-undersampling tradeoff of the new algorithms equivalent to that of the corresponding convex optimization procedures. The new iterative-thresholding algorithms are inspired by belief propagation in graphical models. Our empirical measurements of the sparsity-undersampling tradeoff for the new algorithms agree with theoretical calculations. We show that a state evolution formalism correctly derives the true sparsity-undersampling tradeoff. There is a surprising agreement between earlier calculations based on random convex polytopes and this new, apparently very different theoretical formalism.

preprint2005arXiv

Tight bounds for LDPC and LDGM codes under MAP decoding

A new method for analyzing low density parity check (LDPC) codes and low density generator matrix (LDGM) codes under bit maximum a posteriori probability (MAP) decoding is introduced. The method is based on a rigorous approach to spin glasses developed by Francesco Guerra. It allows to construct lower bounds on the entropy of the transmitted message conditional to the received one. Based on heuristic statistical mechanics calculations, we conjecture such bounds to be tight. The result holds for standard irregular ensembles when used over binary input output symmetric channels. The method is first developed for Tanner graph ensembles with Poisson left degree distribution. It is then generalized to `multi-Poisson' graphs, and, by a completion procedure, to arbitrary degree distribution.

preprint2002arXiv

The Dynamic Phase Transition for Decoding Algorithms

The state-of-the-art error correcting codes are based on large random constructions (random graphs, random permutations, ...) and are decoded by linear-time iterative algorithms. Because of these features, they are remarkable examples of diluted mean-field spin glasses, both from the static and from the dynamic points of view. We analyze the behavior of decoding algorithms using the mapping onto statistical-physics models. This allows to understand the intrinsic (i.e. algorithm independent) features of this behavior.

preprint2001arXiv

Spin models on Platonic solids and asymptotic freedom

We consider a two-dimensional sigma-model with discrete icosahedral/dodecahedral symmetry. We present high-precision finite-size numerical results that show that the continuum limit of this model is different from the continuum limit of the rotationally invariant O(3) sigma-model.

preprint2000arXiv

Operator Product Expansion on the Lattice: a Numerical Test in the Two-Dimensional Non-Linear Sigma-Model

We consider the short-distance behaviour of the product of the Noether O(N) currents in the lattice nonlinear sigma-model. We compare the numerical results with the predictions of the operator product expansion, using one-loop perturbative renormalization-group improved Wilson coefficients. We find that, even on quite small lattices (m a \approx 1/6), the perturbative operator product expansion describes that data with an error of 5-10% in a large window 2a \ltapprox x \ltapprox m^{-1}. We present a detailed discussion of the possible systematic errors.

preprint1999arXiv

Composite operators from the operator product expansion: what can go wrong?

The operator product expansion is used to compute the matrix elements of composite renormalized operators on the lattice. We study the product of two fundamental fields in the two-dimensional sigma-model and discuss the possible sources of systematic errors. The key problem turns out to be the violation of asymptotic scaling.

Andrea Montanari

What is connected

Connect this record

See the researcher in context

Building this map preview

87 published item(s)

Fundamental Barriers to High-Dimensional Regression with Convex Penalties

Optimization of random high-dimensional functions: Structure and algorithms

Statistically Optimal First Order Algorithms: A Proof via Orthogonalization

The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training

Generalization error of random features and kernel methods: hypercontractivity and kernel matrix concentration

Learning with invariances in random features and kernel models

When Do Neural Networks Outperform Kernel Methods?

Imputation for High-Dimensional Linear Regression

Linearized two-layers neural networks in high dimension

Optimization of Mean-field Spin Glasses

Surprises in High-Dimensional Ridgeless Least Squares Interpolation

TAP free energy, spin glasses, and variational inference

The estimation error of general first order methods

A Mean Field View of the Landscape of Two-Layers Neural Networks

A Grothendieck-type inequality for local maxima

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs

How Well Do Local Algorithms Solve Semidefinite Programs?

Performance of a community detection algorithm based on semidefinite programming

Sparse PCA via Covariance Thresholding

Spectral algorithms for tensor completion

A Perspective on Future Research Directions in Information Theory

Asymptotic Mutual Information for the Two-Groups Stochastic Block Model

Computational Implications of Reducing Data to Sufficient Statistics

Convergence rates of sub-sampled Newton methods

Finding One Community in a Sparse Graph

Improved Sum-of-Squares Lower Bounds for Hidden Clique and Hidden Submatrix Problems

On Online Control of False Discovery Rate

The LASSO risk for gaussian matrices

The set of solutions of random XORSAT formulae

Universality in polytope phase transitions and message passing algorithms

Variance Breakdown of Huber (M)-estimators: $n/p \rightarrow m \in (1,\infty)$

A statistical model for tensor PCA

Confidence Intervals and Hypothesis Testing for High-Dimensional Regression

Guess Who Rated This Movie: Identifying Users Through Subspace Clustering

Hypothesis Testing in High-Dimensional Regression under the Gaussian Random Design Model: Asymptotic Theory

Information-theoretically Optimal Sparse PCA

Learning Mixtures of Linear Classifiers

Non-negative Principal Component Analysis: Message Passing Algorithms and Sharp Asymptotics

On the limitation of spectral methods: From the Gaussian hidden clique problem to rank one perturbations of Gaussian tensors

Privacy Tradeoffs in Predictive Analytics

Statistical Estimation: From Denoising to Sparse Regression and Hidden Cliques

Accelerated Time-of-Flight Mass Spectrometry

Accurate Prediction of Phase Transitions in Compressed Sensing via a Connection to Minimax Denoising

Conditional Random Fields, Planted Constraint Satisfaction, and Entropy Concentration

Factor models on locally tree-like graphs

Finding Hidden Cliques of Size \sqrt{N/e} in Nearly Linear Time

High Dimensional Robust M-Estimation: Asymptotic Variance via Approximate Message Passing

Information-Theoretically Optimal Compressed Sensing via Spatial Coupling and Approximate Message Passing

Linear Bandits in High Dimension and Recommendation Systems

Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition

Nearly Optimal Sample Size in Hypothesis Testing for High-Dimensional Regression

The Phase Transition of Matrix Recovery from Gaussian Measurements Matches the Minimax MSE of Matrix Denoising

Guess Who Rated This Movie: Identifying Users Through Subspace Clustering

Identifying Users From Their Rating Patterns

Localization from Incomplete Noisy Distance Measurements

Matrix Completion from Noisy Entries

State Evolution for General Approximate Message Passing Algorithms, with Applications to Spatial Coupling

Subsampling at Information Theoretically Optimal Rates

The replica symmetric solution for Potts models on d-regular graphs

Bargaining dynamics in exchange networks

Compressed Sensing over $\ell_p$-balls: Minimax Mean Square Error

Distributed Storage for Intermittent Energy Sources: Control Design and Performance Limits

Gossip PCA

Graphical Models Concepts in Compressed Sensing

Information Theoretic Limits on Learning Stochastic Differential Equations

Majority dynamics on trees and the dynamic cavity method

On the trade-off between complexity and correlation decay in structural learning algorithms

Optimal coding for the deletion channel with small deletion probability

Robust Max-Product Belief Propagation

Subexponential convergence for information aggregation on regular trees

The dynamics of message passing on dense graphs, with applications to compressed sensing

A Natural Dynamics for Bargaining on Exchange Networks

Applications of Lindeberg Principle in Communications and Statistical Learning

Ising models on locally tree-like graphs