Source author record

David L. Donoho

David L. Donoho appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT math.ST Statistics Theory Computation Machine Learning Artificial Intelligence Computer Vision cond-mat.dis-nn cond-mat.stat-mech math.DG math.FA math.NA math.OC Mathematical Software Methodology Numerical Analysis physics.data-an

Catalog footprint

What is connected

12works

18topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path

The recently discovered Neural Collapse (NC) phenomenon occurs pervasively in today's deep net training paradigm of driving cross-entropy (CE) loss towards zero. During NC, last-layer features collapse to their class-means, both classifiers and class-means collapse to the same Simplex Equiangular Tight Frame, and classifier behavior collapses to the nearest-class-mean decision rule. Recent works demonstrated that deep nets trained with mean squared error (MSE) loss perform comparably to those trained with CE. As a preliminary, we empirically establish that NC emerges in such MSE-trained deep nets as well through experiments on three canonical networks and five benchmark datasets. We provide, in a Google Colab notebook, PyTorch code for reproducing MSE-NC and CE-NC: at https://colab.research.google.com/github/neuralcollapse/neuralcollapse/blob/main/neuralcollapse.ipynb. The analytically-tractable MSE loss offers more mathematical opportunities than the hard-to-analyze CE loss, inspiring us to leverage MSE loss towards the theoretical investigation of NC. We develop three main contributions: (I) We show a new decomposition of the MSE loss into (A) terms directly interpretable through the lens of NC and which assume the last-layer classifier is exactly the least-squares classifier; and (B) a term capturing the deviation from this least-squares classifier. (II) We exhibit experiments on canonical datasets and networks demonstrating that term-(B) is negligible during training. This motivates us to introduce a new theoretical construct: the central path, where the linear classifier stays MSE-optimal for feature activations throughout the dynamics. (III) By studying renormalized gradient flow along the central path, we derive exact dynamics that predict NC.

preprint2020arXiv

Prevalence of Neural Collapse during the terminal phase of deep learning training

Modern practice for training classification deepnets involves a Terminal Phase of Training (TPT), which begins at the epoch where training error first vanishes; During TPT, the training error stays effectively zero while training loss is pushed towards zero. Direct measurements of TPT, for three prototypical deepnet architectures and across seven canonical classification datasets, expose a pervasive inductive bias we call Neural Collapse, involving four deeply interconnected phenomena: (NC1) Cross-example within-class variability of last-layer training activations collapses to zero, as the individual activations themselves collapse to their class-means; (NC2) The class-means collapse to the vertices of a Simplex Equiangular Tight Frame (ETF); (NC3) Up to rescaling, the last-layer classifiers collapse to the class-means, or in other words to the Simplex ETF, i.e. to a self-dual configuration; (NC4) For a given activation, the classifier's decision collapses to simply choosing whichever class has the closest train class-mean, i.e. the Nearest Class Center (NCC) decision rule. The symmetric and very simple geometry induced by the TPT confers important benefits, including better generalization performance, better robustness, and better interpretability.

preprint2016arXiv

Optimal Shrinkage of Singular Values

We consider recovery of low-rank matrices from noisy data by shrinkage of singular values, in which a single, univariate nonlinearity is applied to each of the empirical singular values. We adopt an asymptotic framework, in which the matrix size is much larger than the rank of the signal matrix to be recovered, and the signal-to-noise ratio of the low-rank piece stays constant. For a variety of loss functions, including Mean Square Error (MSE - square Frobenius norm), the nuclear norm loss and the operator norm loss, we show that in this framework there is a well-defined asymptotic loss that we evaluate precisely in each case. In fact, each of the loss functions we study admits a unique admissible shrinkage nonlinearity dominating all other nonlinearities. We provide a general method for evaluating these optimal nonlinearities, and demonstrate our framework by working out simple, explicit formulas for the optimal nonlinearities in the Frobenius, nuclear and operator norm cases. For example, for a square low-rank n-by-n matrix observed in white noise with level $σ$, the optimal nonlinearity for MSE loss simply shrinks each data singular value $y$ to $\sqrt{y^2-4nσ^2 }$ (or to 0 if $y<2\sqrt{n}σ$). This optimal nonlinearity guarantees an asymptotic MSE of $2nrσ^2$, which compares favorably with optimally tuned hard thresholding and optimally tuned soft thresholding, providing guarantees of $3nrσ^2$ and $6nrσ^2$, respectively. Our general method also allows one to evaluate optimal shrinkers numerically to arbitrary precision. As an example, we compute optimal shrinkers for the Schatten-p norm loss, for any p>0.

preprint2015arXiv

Variance Breakdown of Huber (M)-estimators: $n/p \rightarrow m \in (1,\infty)$

A half century ago, Huber evaluated the minimax asymptotic variance in scalar location estimation, $ \min_ψ\max_{F \in {\cal F}_ε} V(ψ, F) = \frac{1}{I(F_ε^*)} $, where $V(ψ,F)$ denotes the asymptotic variance of the $(M)$-estimator for location with score function $ψ$, and $I(F_ε^*)$ is the minimal Fisher information $ \min_{{\cal F}_ε} I(F)$ over the class of $ε$-Contaminated Normal distributions. We consider the linear regression model $Y = Xθ_0 + W$, $W_i\sim_{\text{i.i.d.}}F$, and iid Normal predictors $X_{i,j}$, working in the high-dimensional-limit asymptotic where the number $n$ of observations and $p$ of variables both grow large, while $n/p \rightarrow m \in (1,\infty)$; hence $m$ plays the role of `asymptotic number of observations per parameter estimated'. Let $V_m(ψ,F)$ denote the per-coordinate asymptotic variance of the $(M)$-estimator of regression in the $n/p \rightarrow m$ regime. Then $V_m \neq V$; however $V_m \rightarrow V$ as $m \rightarrow \infty$. In this paper we evaluate the minimax asymptotic variance of the Huber $(M)$-estimate. The statistician minimizes over the family $(ψ_λ)_{λ> 0}$ of all tunings of Huber $(M)$-estimates of regression, and Nature maximizes over gross-error contaminations $F \in {\cal F}_ε$. Suppose that $I(F_ε^*) \cdot m > 1$. Then $ \min_λ\max_{F \in {\cal F}_ε} V_m(ψ_λ, F) = \frac{1}{I(F_ε^*) - 1/m} $. Strikingly, if $I(F_ε^*) \cdot m \leq 1$, then the minimax asymptotic variance is $+\infty$. The breakdown point is where the Fisher information per parameter equals unity.

preprint2014arXiv

The Optimal Hard Threshold for Singular Values is 4/sqrt(3)

We consider recovery of low-rank matrices from noisy data by hard thresholding of singular values, where singular values below a prescribed threshold $λ$ are set to 0. We study the asymptotic MSE in a framework where the matrix size is large compared to the rank of the matrix to be recovered, and the signal-to-noise ratio of the low-rank piece stays constant. The AMSE-optimal choice of hard threshold, in the case of n-by-n matrix in noise level σ, is simply $(4/\sqrt{3}) \sqrt{n}σ\approx 2.309 \sqrt{n}σ$ when $σ$ is known, or simply $2.858\cdot y_{med}$ when $σ$ is unknown, where $y_{med}$ is the median empirical singular value. For nonsquare $m$ by $n$ matrices with $m \neq n$, these thresholding coefficients are replaced with different provided constants. In our asymptotic framework, this thresholding rule adapts to unknown rank and to unknown noise level in an optimal manner: it is always better than hard thresholding at any other value, no matter what the matrix is that we are trying to recover, and is always better than ideal Truncated SVD (TSVD), which truncates at the true rank of the low-rank matrix we are trying to recover. Hard thresholding at the recommended value to recover an n-by-n matrix of rank r guarantees an AMSE at most $3nrσ^2$. In comparison, the guarantee provided by TSVD is $5nrσ^2$, the guarantee provided by optimally tuned singular value soft thresholding is $6nrσ^2$, and the best guarantee achievable by any shrinkage of the data singular values is $2nrσ^2$. Empirical evidence shows that these AMSE properties of the $4/\sqrt{3}$ thresholding rule remain valid even for relatively small n, and that performance improvement over TSVD and other shrinkage rules is substantial, turning it into the practical hard threshold of choice.

preprint2013arXiv

Information-Theoretically Optimal Compressed Sensing via Spatial Coupling and Approximate Message Passing

We study the compressed sensing reconstruction problem for a broad class of random, band-diagonal sensing matrices. This construction is inspired by the idea of spatial coupling in coding theory. As demonstrated heuristically and numerically by Krzakala et al. \cite{KrzakalaEtAl}, message passing algorithms can effectively solve the reconstruction problem for spatially coupled measurements with undersampling rates close to the fraction of non-zero coordinates. We use an approximate message passing (AMP) algorithm and analyze it through the state evolution method. We give a rigorous proof that this approach is successful as soon as the undersampling rate $δ$ exceeds the (upper) Rényi information dimension of the signal, $\uRenyi(p_X)$. More precisely, for a sequence of signals of diverging dimension $n$ whose empirical distribution converges to $p_X$, reconstruction is with high probability successful from $\uRenyi(p_X)\, n+o(n)$ measurements taken according to a band diagonal matrix. For sparse signals, i.e., sequences of dimension $n$ and $k(n)$ non-zero entries, this implies reconstruction from $k(n)+o(n)$ measurements. For `discrete' signals, i.e., signals whose coordinates take a fixed finite set of values, this implies reconstruction from $o(n)$ measurements. The result is robust with respect to noise, does not apply uniquely to random signals, but requires the knowledge of the empirical distribution of the signal $p_X$.

preprint2013arXiv

The Phase Transition of Matrix Recovery from Gaussian Measurements Matches the Minimax MSE of Matrix Denoising

Let $X_0$ be an unknown $M$ by $N$ matrix. In matrix recovery, one takes $n < MN$ linear measurements $y_1,..., y_n$ of $X_0$, where $y_i = \Tr(a_i^T X_0)$ and each $a_i$ is a $M$ by $N$ matrix. For measurement matrices with Gaussian i.i.d entries, it known that if $X_0$ is of low rank, it is recoverable from just a few measurements. A popular approach for matrix recovery is Nuclear Norm Minimization (NNM). Empirical work reveals a \emph{phase transition} curve, stated in terms of the undersampling fraction $δ(n,M,N) = n/(MN)$, rank fraction $ρ=r/N$ and aspect ratio $β=M/N$. Specifically, a curve $δ^* = δ^*(ρ;β)$ exists such that, if $δ> δ^*(ρ;β)$, NNM typically succeeds, while if $δ< δ^*(ρ;β)$, it typically fails. An apparently quite different problem is matrix denoising in Gaussian noise, where an unknown $M$ by $N$ matrix $X_0$ is to be estimated based on direct noisy measurements $Y = X_0 + Z$, where the matrix $Z$ has iid Gaussian entries. It has been empirically observed that, if $X_0$ has low rank, it may be recovered quite accurately from the noisy measurement $Y$. A popular matrix denoising scheme solves the unconstrained optimization problem $\text{min} \| Y - X \|_F^2/2 + λ\|X\|_* $. When optimally tuned, this scheme achieves the asymptotic minimax MSE $\cM(ρ) = \lim_{N \goto \infty} \inf_λ\sup_{\rank(X) \leq ρ\cdot N} MSE(X,\hat{X}_λ)$. We report extensive experiments showing that the phase transition $δ^*(ρ)$ in the first problem coincides with the minimax risk curve $\cM(ρ)$ in the second problem, for {\em any} rank fraction $0 < ρ< 1$.

preprint2010arXiv

Microlocal Analysis of the Geometric Separation Problem

Image data are often composed of two or more geometrically distinct constituents; in galaxy catalogs, for instance, one sees a mixture of pointlike structures (galaxy superclusters) and curvelike structures (filaments). It would be ideal to process a single image and extract two geometrically `pure' images, each one containing features from only one of the two geometric constituents. This seems to be a seriously underdetermined problem, but recent empirical work achieved highly persuasive separations. We present a theoretical analysis showing that accurate geometric separation of point and curve singularities can be achieved by minimizing the $\ell_1$ norm of the representing coefficients in two geometrically complementary frames: wavelets and curvelets. Driving our analysis is a specific property of the ideal (but unachievable) representation where each content type is expanded in the frame best adapted to it. This ideal representation has the property that important coefficients are clustered geometrically in phase space, and that at fine scales, there is very little coherence between a cluster of elements in one frame expansion and individual elements in the complementary frame. We formally introduce notions of cluster coherence and clustered sparsity and use this machinery to show that the underdetermined systems of linear equations can be stably solved by $\ell_1$ minimization; microlocal phase space helps organize the calculations that cluster coherence requires.

preprint2010arXiv

The Noise-Sensitivity Phase Transition in Compressed Sensing

Consider the noisy underdetermined system of linear equations: y=Ax0 + z0, with n x N measurement matrix A, n < N, and Gaussian white noise z0 ~ N(0,σ^2 I). Both y and A are known, both x0 and z0 are unknown, and we seek an approximation to x0. When x0 has few nonzeros, useful approximations are obtained by l1-penalized l2 minimization, in which the reconstruction \hxl solves min || y - Ax||^2/2 + λ||x||_1. Evaluate performance by mean-squared error (MSE = E ||\hxl - x0||_2^2/N). Consider matrices A with iid Gaussian entries and a large-system limit in which n,N\to\infty with n/N \to δand k/n \to ρ. Call the ratio MSE/σ^2 the noise sensitivity. We develop formal expressions for the MSE of \hxl, and evaluate its worst-case formal noise sensitivity over all types of k-sparse signals. The phase space 0 < δ, ρ< 1 is partitioned by curve ρ= \rhoMSE(δ) into two regions. Formal noise sensitivity is bounded throughout the region ρ< \rhoMSE(δ) and is unbounded throughout the region ρ> \rhoMSE(δ). The phase boundary ρ= \rhoMSE(δ) is identical to the previously-known phase transition curve for equivalence of l1 - l0 minimization in the k-sparse noiseless case. Hence a single phase boundary describes the fundamental phase transitions both for the noiseless and noisy cases. Extensive computational experiments validate the predictions of this formalism, including the existence of game theoretical structures underlying it. Underlying our formalism is the AMP algorithm introduced earlier by the authors. Other papers by the authors detail expressions for the formal MSE of AMP and its close connection to l1-penalized reconstruction. Here we derive the minimax formal MSE of AMP and then read out results for l1-penalized reconstruction.

preprint2009arXiv

Message Passing Algorithms for Compressed Sensing

Compressed sensing aims to undersample certain high-dimensional signals, yet accurately reconstruct them by exploiting signal characteristics. Accurate reconstruction is possible when the object to be recovered is sufficiently sparse in a known basis. Currently, the best known sparsity-undersampling tradeoff is achieved when reconstructing by convex optimization -- which is expensive in important large-scale applications. Fast iterative thresholding algorithms have been intensively studied as alternatives to convex optimization for large-scale problems. Unfortunately known fast algorithms offer substantially worse sparsity-undersampling tradeoffs than convex optimization. We introduce a simple costless modification to iterative thresholding making the sparsity-undersampling tradeoff of the new algorithms equivalent to that of the corresponding convex optimization procedures. The new iterative-thresholding algorithms are inspired by belief propagation in graphical models. Our empirical measurements of the sparsity-undersampling tradeoff for the new algorithms agree with theoretical calculations. We show that a state evolution formalism correctly derives the true sparsity-undersampling tradeoff. There is a surprising agreement between earlier calculations based on random convex polytopes and this new, apparently very different theoretical formalism.

preprint2009arXiv

Observed Universality of Phase Transitions in High-Dimensional Geometry, with Implications for Modern Data Analysis and Signal Processing

We review connections between phase transitions in high-dimensional combinatorial geometry and phase transitions occurring in modern high-dimensional data analysis and signal processing. In data analysis, such transitions arise as abrupt breakdown of linear model selection, robust data fitting or compressed sensing reconstructions, when the complexity of the model or the number of outliers increases beyond a threshold. In combinatorial geometry these transitions appear as abrupt changes in the properties of face counts of convex polytopes when the dimensions are varied. The thresholds in these very different problems appear in the same critical locations after appropriate calibration of variables. These thresholds are important in each subject area: for linear modelling, they place hard limits on the degree to which the now-ubiquitous high-throughput data analysis can be successful; for robustness, they place hard limits on the degree to which standard robust fitting methods can tolerate outliers before breaking down; for compressed sensing, they define the sharp boundary of the undersampling/sparsity tradeoff in undersampling theorems. Existing derivations of phase transitions in combinatorial geometry assume the underlying matrices have independent and identically distributed (iid) Gaussian elements. In applications, however, it often seems that Gaussianity is not required. We conducted an extensive computational experiment and formal inferential analysis to test the hypothesis that these phase transitions are {\it universal} across a range of underlying matrix ensembles. The experimental results are consistent with an asymptotic large-$n$ universality across matrix ensembles; finite-sample universality can be rejected.

preprint2009arXiv

Optimally Tuned Iterative Reconstruction Algorithms for Compressed Sensing

We conducted an extensive computational experiment, lasting multiple CPU-years, to optimally select parameters for two important classes of algorithms for finding sparse solutions of underdetermined systems of linear equations. We make the optimally tuned implementations available at {\tt sparselab.stanford.edu}; they run `out of the box' with no user tuning: it is not necessary to select thresholds or know the likely degree of sparsity. Our class of algorithms includes iterative hard and soft thresholding with or without relaxation, as well as CoSaMP, subspace pursuit and some natural extensions. As a result, our optimally tuned algorithms dominate such proposals. Our notion of optimality is defined in terms of phase transitions, i.e. we maximize the number of nonzeros at which the algorithm can successfully operate. We show that the phase transition is a well-defined quantity with our suite of random underdetermined linear systems. Our tuning gives the highest transition possible within each class of algorithms.

David L. Donoho

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path

Prevalence of Neural Collapse during the terminal phase of deep learning training

Optimal Shrinkage of Singular Values

Variance Breakdown of Huber (M)-estimators: $n/p \rightarrow m \in (1,\infty)$

The Optimal Hard Threshold for Singular Values is 4/sqrt(3)

Information-Theoretically Optimal Compressed Sensing via Spatial Coupling and Approximate Message Passing

The Phase Transition of Matrix Recovery from Gaussian Measurements Matches the Minimax MSE of Matrix Denoising

Microlocal Analysis of the Geometric Separation Problem

The Noise-Sensitivity Phase Transition in Compressed Sensing

Message Passing Algorithms for Compressed Sensing

Observed Universality of Phase Transitions in High-Dimensional Geometry, with Implications for Modern Data Analysis and Signal Processing

Optimally Tuned Iterative Reconstruction Algorithms for Compressed Sensing