Source author record

Rachel Ward

Rachel Ward appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

47works

24topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Bootstrapping the error of Oja's algorithm

We consider the problem of quantifying uncertainty for the estimation error of the leading eigenvector from Oja's algorithm for streaming principal component analysis, where the data are generated IID from some unknown distribution. By combining classical tools from the U-statistics literature with recent results on high-dimensional central limit theorems for quadratic forms of random vectors and concentration of matrix products, we establish a weighted $χ^2$ approximation result for the $\sin^2$ error between the population eigenvector and the output of Oja's algorithm. Since estimating the covariance matrix associated with the approximating distribution requires knowledge of unknown model parameters, we propose a multiplier bootstrap algorithm that may be updated in an online manner. We establish conditions under which the bootstrap distribution is close to the corresponding sampling distribution with high probability, thereby establishing the bootstrap as a consistent inferential method in an appropriate asymptotic regime.

preprint2022arXiv

How catastrophic can catastrophic forgetting be in linear regression?

To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when T tasks in d dimensions are presented cyclically for k iterations, we prove an upper bound of T^2 * min{1/sqrt(k), d/k} on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the T^2 factor can be lifted when tasks are presented in a random ordering.

preprint2022arXiv

Implicit Regularization and Convergence for Weight Normalization

Normalization methods such as batch [Ioffe and Szegedy, 2015], weight [Salimansand Kingma, 2016], instance [Ulyanov et al., 2016], and layer normalization [Baet al., 2016] have been widely used in modern machine learning. Here, we study the weight normalization (WN) method [Salimans and Kingma, 2016] and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least-squares regression. WN and rPGD reparametrize the weights with a scale g and a unit vector w and thus the objective function becomes non-convex. We show that this non-convex formulation has beneficial regularization effects compared to gradient descent on the original objective. These methods adaptively regularize the weights and converge close to the minimum l2 norm solution, even for initializations far from zero. For certain stepsizes of g and w , we show that they can converge close to the minimum norm solution. This is different from the behavior of gradient descent, which converges to the minimum norm solution only when started at a point in the range space of the feature matrix, and is thus more sensitive to initialization.

preprint2022arXiv

Sample Efficiency of Data Augmentation Consistency Regularization

Data augmentation is popular in the training of large neural networks; currently, however, there is no clear theoretical comparison between different algorithmic choices on how to use augmented data. In this paper, we take a step in this direction - we first present a simple and novel analysis for linear regression with label invariant augmentations, demonstrating that data augmentation consistency (DAC) is intrinsically more efficient than empirical risk minimization on augmented data (DA-ERM). The analysis is then extended to misspecified augmentations (i.e., augmentations that change the labels), which again demonstrates the merit of DAC over DA-ERM. Further, we extend our analysis to non-linear models (e.g., neural networks) and present generalization bounds. Finally, we perform experiments that make a clean and apples-to-apples comparison (i.e., with no extra modeling or data tweaks) between DAC and DA-ERM using CIFAR-100 and WideResNet; these together demonstrate the superior efficacy of DAC.

preprint2022arXiv

The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance

We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal convergence rate of $\mathcal{O}\left(\frac{\mathrm{poly}\log(T)}{\sqrt{T}}\right)$ after $T$ iterations under the same assumptions as optimally-tuned non adaptive SGD (unbounded gradient norms and affine noise variance scaling), and crucially, without needing any tuning parameters. We thus establish that adaptive gradient methods exhibit order-optimal convergence in much broader regimes than previously understood.

preprint2021arXiv

Arbitrary-length analogs to de Bruijn sequences

Let $\widetildeα$ be a length-$L$ cyclic sequence of characters from a size-$K$ alphabet $\mathcal{A}$ such that the number of occurrences of any length-$m$ string on $\mathcal{A}$ as a substring of $\widetildeα$ is $\lfloor L / K^m \rfloor$ or $\lceil L / K^m \rceil$. When $L = K^N$ for any positive integer $N$, $\widetildeα$ is a de Bruijn sequence of order $N$, and when $L \neq K^N$, $\widetildeα$ shares many properties with de Bruijn sequences. We describe an algorithm that outputs some $\widetildeα$ for any combination of $K \geq 2$ and $L \geq 1$ in $O(L)$ time using $O(L \log K)$ space. This algorithm extends Lempel's recursive construction of a binary de Bruijn sequence. An implementation written in Python is available at https://github.com/nelloreward/pkl.

preprint2021arXiv

Generalization Bounds for Sparse Random Feature Expansions

Random feature methods have been successful in various machine learning tasks, are easy to compute, and come with theoretical accuracy bounds. They serve as an alternative approach to standard neural networks since they can represent similar function spaces without a costly training phase. However, for accuracy, random feature methods require more measurements than trainable parameters, limiting their use for data-scarce applications or problems in scientific machine learning. This paper introduces the sparse random feature expansion to obtain parsimonious random feature models. Specifically, we leverage ideas from compressive sensing to generate random feature expansions with theoretical guarantees even in the data-scarce setting. In particular, we provide generalization bounds for functions in a certain class (that is dense in a reproducing kernel Hilbert space) depending on the number of samples and the distribution of features. The generalization bounds improve with additional structural conditions, such as coordinate sparsity, compact clusters of the spectrum, or rapid spectral decay. In particular, by introducing sparse features, i.e. features with random sparse weights, we provide improved bounds for low order functions. We show that the sparse random feature expansions outperforms shallow networks in several scientific machine learning tasks.

preprint2021arXiv

Streaming k-PCA: Efficient guarantees for Oja's algorithm, beyond rank-one updates

We analyze Oja's algorithm for streaming $k$-PCA and prove that it achieves performance nearly matching that of an optimal offline algorithm. Given access to a sequence of i.i.d. $d \times d$ symmetric matrices, we show that Oja's algorithm can obtain an accurate approximation to the subspace of the top $k$ eigenvectors of their expectation using a number of samples that scales polylogarithmically with $d$. Previously, such a result was only known in the case where the updates have rank one. Our analysis is based on recently developed matrix concentration tools, which allow us to prove strong bounds on the tails of the random matrices which arise in the course of the algorithm's execution.

preprint2020arXiv

Linear Convergence of Adaptive Stochastic Gradient Descent

We prove that the norm version of the adaptive stochastic gradient method (AdaGrad-Norm) achieves a linear convergence rate for a subset of either strongly convex functions or non-convex functions that satisfy the Polyak Lojasiewicz (PL) inequality. The paper introduces the notion of Restricted Uniform Inequality of Gradients (RUIG)---which is a measure of the balanced-ness of the stochastic gradient norms---to depict the landscape of a function. RUIG plays a key role in proving the robustness of AdaGrad-Norm to its hyper-parameter tuning in the stochastic setting. On top of RUIG, we develop a two-stage framework to prove the linear convergence of AdaGrad-Norm without knowing the parameters of the objective functions. This framework can likely be extended to other adaptive stepsize algorithms. The numerical experiments validate the theory and suggest future directions for improvement.

preprint2020arXiv

Matrix Concentration for Products

This paper develops nonasymptotic growth and concentration bounds for a product of independent random matrices. These results sharpen and generalize recent work of Henriksen-Ward, and they are similar in spirit to the results of Ahlswede-Winter and of Tropp for a sum of independent random matrices. The argument relies on the uniform smoothness properties of the Schatten trace classes.

preprint2016arXiv

A polynomial-time relaxation of the Gromov-Hausdorff distance

The Gromov-Hausdorff distance provides a metric on the set of isometry classes of compact metric spaces. Unfortunately, computing this metric directly is believed to be computationally intractable. Motivated by applications in shape matching and point-cloud comparison, we study a semidefinite programming relaxation of the Gromov-Hausdorff metric. This relaxation can be computed in polynomial time, and somewhat surprisingly is itself a pseudometric. We describe the induced topology on the set of compact metric spaces. Finally, we demonstrate the numerical performance of various algorithms for computing the relaxed distance and apply these algorithms to several relevant data sets. In particular we propose a greedy algorithm for finding the best correspondence between finite metric spaces that can handle hundreds of points.

preprint2016arXiv

Clustering subgaussian mixtures by semidefinite programming

We introduce a model-free relax-and-round algorithm for k-means clustering based on a semidefinite relaxation due to Peng and Wei. The algorithm interprets the SDP output as a denoised version of the original data and then rounds this output to a hard clustering. We provide a generic method for proving performance guarantees for this algorithm, and we analyze the algorithm in the context of subgaussian mixture models. We also study the fundamental limits of estimating Gaussian centers by k-means clustering in order to compare our approximation guarantee to the theoretically optimal k-means clustering solution.

preprint2016arXiv

Exact Recovery of Chaotic Systems from Highly Corrupted Data

Learning the governing equations in dynamical systems from time-varying measurements is of great interest across different scientific fields. This task becomes prohibitive when such data is moreover highly corrupted, for example, due to the recording mechanism failing over unknown intervals of time. When the underlying system exhibits chaotic behavior, such as sensitivity to initial conditions, it is crucial to recover the governing equations with high precision. In this work, we consider continuous time dynamical systems $\dot{x} = f(x)$ where each component of $f: \mathbb{R}^{d} \rightarrow \mathbb{R}^d$ is a multivariate polynomial of maximal degree $p$; we aim to identify $f$ exactly from possibly highly corrupted measurements $x(t_1), x(t_2), \dots, x(t_m)$. As our main theoretical result, we show that if the system is sufficiently ergodic that this data satisfies a strong central limit theorem (as is known to hold for chaotic Lorenz systems), then the governing equations $f$ can be exactly recovered as the solution to an $\ell_1$ minimization problem -- even if a large percentage of the data is corrupted by outliers. Numerically, we apply the alternating minimization method to solve the corresponding constrained optimization problem. Through several examples of 3D chaotic systems and higher dimensional hyperchaotic systems, we illustrate the power, generality, and efficiency of the algorithm for recovering governing equations from noisy and highly corrupted measurement data.

preprint2016arXiv

Fast Cross-Polytope Locality-Sensitive Hashing

We provide a variant of cross-polytope locality sensitive hashing with respect to angular distance which is provably optimal in asymptotic sensitivity and enjoys $\mathcal{O}(d \ln d )$ hash computation time. Building on a recent result (by Andoni, Indyk, Laarhoven, Razenshteyn, Schmidt, 2015), we show that optimal asymptotic sensitivity for cross-polytope LSH is retained even when the dense Gaussian matrix is replaced by a fast Johnson-Lindenstrauss transform followed by discrete pseudo-rotation, reducing the hash computation time from $\mathcal{O}(d^2)$ to $\mathcal{O}(d \ln d )$. Moreover, our scheme achieves the optimal rate of convergence for sensitivity. By incorporating a low-randomness Johnson-Lindenstrauss transform, our scheme can be modified to require only $\mathcal{O}(\ln^9(d))$ random bits

preprint2016arXiv

One-bit compressive sensing with norm estimation

Consider the recovery of an unknown signal ${x}$ from quantized linear measurements. In the one-bit compressive sensing setting, one typically assumes that ${x}$ is sparse, and that the measurements are of the form $\operatorname{sign}(\langle {a}_i, {x} \rangle) \in \{\pm1\}$. Since such measurements give no information on the norm of ${x}$, recovery methods from such measurements typically assume that $\| {x} \|_2=1$. We show that if one allows more generally for quantized affine measurements of the form $\operatorname{sign}(\langle {a}_i, {x} \rangle + b_i)$, and if the vectors ${a}_i$ are random, an appropriate choice of the affine shifts $b_i$ allows norm recovery to be easily incorporated into existing methods for one-bit compressive sensing. Additionally, we show that for arbitrary fixed ${x}$ in the annulus $r \leq \| {x} \|_2 \leq R$, one may estimate the norm $\| {x} \|_2$ up to additive error $δ$ from $m \gtrsim R^4 r^{-2} δ^{-2}$ such binary measurements through a single evaluation of the inverse Gaussian error function. Finally, all of our recovery guarantees can be made universal over sparse vectors, in the sense that with high probability, one set of measurements and thresholds can successfully estimate all sparse vectors ${x}$ within a Euclidean ball of known radius.

preprint2016arXiv

The local convexity of solving systems of quadratic equations

This paper considers the recovery of a rank $r$ positive semidefinite matrix $X X^T\in\mathbb{R}^{n\times n}$ from $m$ scalar measurements of the form $y_i := a_i^T X X^T a_i$ (i.e., quadratic measurements of $X$). Such problems arise in a variety of applications, including covariance sketching of high-dimensional data streams, quadratic regression, quantum state tomography, among others. A natural approach to this problem is to minimize the loss function $f(U) = \sum_i (y_i - a_i^TUU^Ta_i)^2$ which has an entire manifold of solutions given by $\{XO\}_{O\in\mathcal{O}_r}$ where $\mathcal{O}_r$ is the orthogonal group of $r\times r$ orthogonal matrices; this is {\it non-convex} in the $n\times r$ matrix $U$, but methods like gradient descent are simple and easy to implement (as compared to semidefinite relaxation approaches). In this paper we show that once we have $m \geq C nr \log^2(n)$ samples from isotropic gaussian $a_i$, with high probability {\em (a)} this function admits a dimension-independent region of {\em local strong convexity} on lines perpendicular to the solution manifold, and {\em (b)} with an additional polynomial factor of $r$ samples, a simple spectral initialization will land within the region of convexity with high probability. Together, this implies that gradient descent with initialization (but no re-sampling) will converge linearly to the correct $X$, up to an orthogonal transformation. We believe that this general technique (local convexity reachable by spectral initialization) should prove applicable to a broader class of nonconvex optimization problems.

preprint2016arXiv

The sample complexity of weighted sparse approximation

For Gaussian sampling matrices, we provide bounds on the minimal number of measurements $m$ required to achieve robust weighted sparse recovery guarantees in terms of how well a given prior model for the sparsity support aligns with the true underlying support. Our main contribution is that for a sparse vector ${\bf x} \in \mathbb{R}^N$ supported on an unknown set $\mathcal{S} \subset \{1, \dots, N\}$ with $|\mathcal{S}|\leq k$, if $\mathcal{S}$ has \emph{weighted cardinality} $ω(\mathcal{S}) := \sum_{j \in \mathcal{S}} ω_j^2$, and if the weights on $\mathcal{S}^c$ exhibit mild growth, $ω_j^2 \geq γ\log(j/ω(\mathcal{S}))$ for $j\in\mathcal{S}^c$ and $γ> 0$, then the sample complexity for sparse recovery via weighted $\ell_1$-minimization using weights $ω_j$ is linear in the weighted sparsity level, and $m = \mathcal{O}(ω(\mathcal{S})/γ)$. This main result is a generalization of special cases including a) the standard sparse recovery setting where all weights $ω_j \equiv 1$, and $m = \mathcal{O}\left(k\log\left(N/k\right)\right)$; b) the setting where the support is known a priori, and $m = \mathcal{O}(k)$; and c) the setting of sparse recovery with prior information, and $m$ depends on how well the weights are aligned with the support set $\mathcal{S}$. We further extend the results in case c) to the setting of additive noise. Our results are {\em nonuniform} that is they apply for a fixed support, unknown a priori, and the weights on $\mathcal{S}$ do not all have to be smaller than the weights on $\mathcal{S}^c$ for our recovery results to hold.

preprint2015arXiv

A unified framework for linear dimensionality reduction in L1

For a family of interpolation norms $\| \cdot \|_{1,2,s}$ on $\mathbb{R}^n$, we provide a distribution over random matrices $Φ_s \in \mathbb{R}^{m \times n}$ parametrized by sparsity level $s$ such that for a fixed set $X$ of $K$ points in $\mathbb{R}^n$, if $m \geq C s \log(K)$ then with high probability, $\frac{1}{2} \| x \|_{1,2,s} \leq \| Φ_s (x) \|_1 \leq 2 \| x\|_{1,2,s}$ for all $x\in X$. Several existing results in the literature reduce to special cases of this result at different values of $s$: for $s=n$, $\| x\|_{1,2,n} \equiv \| x \|_{1}$ and we recover that dimension reducing linear maps can preserve the $\ell_1$-norm up to a distortion proportional to the dimension reduction factor, which is known to be the best possible such result. For $s=1$, $\|x \|_{1,2,1} \equiv \| x \|_{2}$, and we recover an $\ell_2 / \ell_1$ variant of the Johnson-Lindenstrauss Lemma for Gaussian random matrices. Finally, if $x$ is $s$-sparse, then $\| x \|_{1,2,s} = \| x \|_1$ and we recover that $s$-sparse vectors in $\ell_1^n$ embed into $\ell_1^{\mathcal{O}(s \log(n))}$ via sparse random matrix constructions.

preprint2015arXiv

An arithmetic-geometric mean inequality for products of three matrices

Consider the following noncommutative arithmetic-geometric mean inequality: given positive-semidefinite matrices $\mathbf{A}_1, \dots, \mathbf{A}_n$, the following holds for each integer $m \leq n$: $$ \frac{1}{n^m}\sum_{j_1, j_2, \dots, j_m = 1}^{n} ||| \mathbf{A}_{j_1} \mathbf{A}_{j_2} \dots \mathbf{A}_{j_m} ||| \geq \frac{(n-m)!}{n!} \sum_{\substack{j_1, j_2, \dots, j_m = 1 \\ \text{all distinct}}}^{n} ||| \mathbf{A}_{j_1} \mathbf{A}_{j_2} \dots \mathbf{A}_{j_m} |||,$$ where $||| \cdot |||$ denotes a unitarily invariant norm, including the operator norm and Schatten p-norms as special cases. While this inequality in full generality remains a conjecture, we prove that the inequality holds for products of up to three matrices, $m \leq 3$. The proofs for $m = 1,2$ are straightforward; to derive the proof for $m=3$, we appeal to a variant of the classic Araki-Lieb-Thirring inequality for permutations of matrix products.

preprint2015arXiv

Compressive Sensing with Redundant Dictionaries and Structured Measurements

Consider the problem of recovering an unknown signal from undersampled measurements, given the knowledge that the signal has a sparse representation in a specified dictionary $D$. This problem is now understood to be well-posed and efficiently solvable under suitable assumptions on the measurements and dictionary, if the number of measurements scales roughly with the sparsity level. One sufficient condition for such is the $D$-restricted isometry property ($D$-RIP), which asks that the sampling matrix approximately preserve the norm of all signals which are sufficiently sparse in $D$. While many classes of random matrices are known to satisfy such conditions, such matrices are not representative of the structural constraints imposed by practical sensing systems. We close this gap in the theory by demonstrating that one can subsample a fixed orthogonal matrix in such a way that the $D$-RIP will hold, provided this basis is sufficiently incoherent with the sparsifying dictionary $D$. We also extend this analysis to allow for weighted sparse expansions. Consequently, we arrive at compressive sensing recovery guarantees for structured measurements and redundant dictionaries, opening the door to a wide array of practical applications.

preprint2015arXiv

Interpolation via weighted $l_1$ minimization

Functions of interest are often smooth and sparse in some sense, and both priors should be taken into account when interpolating sampled data. Classical linear interpolation methods are effective under strong regularity assumptions, but cannot incorporate nonlinear sparsity structure. At the same time, nonlinear methods such as $l_1$ minimization can reconstruct sparse functions from very few samples, but do not necessarily encourage smoothness. Here we show that weighted $l_1$ minimization effectively merges the two approaches, promoting both sparsity and smoothness in reconstruction. More precisely, we provide specific choices of weights in the $l_1$ objective to achieve rates for functions with coefficient sequences in weighted $l_p$ spaces, $p<=1$. We consider the implications of these results for spherical harmonic and polynomial interpolation, in the univariate and multivariate setting. Along the way, we extend concepts from compressive sensing such as the restricted isometry property and null space property to accommodate weighted sparse expansions; these developments should be of independent interest in the study of structured sparse approximations and continuous-time compressive sensing problems.

preprint2015arXiv

Relax, no need to round: integrality of clustering formulations

We study exact recovery conditions for convex relaxations of point cloud clustering problems, focusing on two of the most common optimization problems for unsupervised clustering: $k$-means and $k$-median clustering. Motivations for focusing on convex relaxations are: (a) they come with a certificate of optimality, and (b) they are generic tools which are relatively parameter-free, not tailored to specific assumptions over the input. More precisely, we consider the distributional setting where there are $k$ clusters in $\mathbb{R}^m$ and data from each cluster consists of $n$ points sampled from a symmetric distribution within a ball of unit radius. We ask: what is the minimal separation distance between cluster centers needed for convex relaxations to exactly recover these $k$ clusters as the optimal integral solution? For the $k$-median linear programming relaxation we show a tight bound: exact recovery is obtained given arbitrarily small pairwise separation $ε> 0$ between the balls. In other words, the pairwise center separation is $Δ> 2+ε$. Under the same distributional model, the $k$-means LP relaxation fails to recover such clusters at separation as large as $Δ= 4$. Yet, if we enforce PSD constraints on the $k$-means LP, we get exact cluster recovery at center separation $Δ> 2\sqrt2(1+\sqrt{1/m})$. In contrast, common heuristics such as Lloyd's algorithm (a.k.a. the $k$-means algorithm) can fail to recover clusters in this setting; even with arbitrarily large cluster separation, k-means++ with overseeding by any constant factor fails with high probability at exact cluster recovery. To complement the theoretical analysis, we provide an experimental study of the recovery guarantees for these various methods, and discuss several open problems which these experiments suggest.

preprint2015arXiv

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning $(L/μ)^2$ (where $L$ is a bound on the smoothness and $μ$ on the strong convexity) to a linear dependence on $L/μ$. Furthermore, we show how reweighting the sampling distribution (i.e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results. We also discuss importance sampling for SGD more broadly and show how it can improve convergence also in other scenarios. Our results are based on a connection we make between SGD and the randomized Kaczmarz algorithm, which allows us to transfer ideas between the separate bodies of literature studying each of the two methods. In particular, we recast the randomized Kaczmarz algorithm as an instance of SGD, and apply our results to prove its exponential convergence, but to the solution of a weighted least squares problem rather than the original least squares problem. We then present a modified Kaczmarz algorithm with partially biased sampling which does converge to the original least squares solution with the same exponential convergence rate.

preprint2014arXiv

Completing Any Low-rank Matrix, Provably

Matrix completion, i.e., the exact and provable recovery of a low-rank matrix from a small subset of its elements, is currently only known to be possible if the matrix satisfies a restrictive structural constraint---known as {\em incoherence}---on its row and column spaces. In these cases, the subset of elements is sampled uniformly at random. In this paper, we show that {\em any} rank-$ r $ $ n$-by-$ n $ matrix can be exactly recovered from as few as $O(nr \log^2 n)$ randomly chosen elements, provided this random choice is made according to a {\em specific biased distribution}: the probability of any element being sampled should be proportional to the sum of the leverage scores of the corresponding row, and column. Perhaps equally important, we show that this specific form of sampling is nearly necessary, in a natural precise sense; this implies that other perhaps more intuitive sampling schemes fail. We further establish three ways to use the above result for the setting when leverage scores are not known \textit{a priori}: (a) a sampling strategy for the case when only one of the row or column spaces are incoherent, (b) a two-phase sampling procedure for general matrices that first samples to estimate leverage scores followed by sampling for exact recovery, and (c) an analysis showing the advantages of weighted nuclear/trace-norm minimization over the vanilla un-weighted formulation for the case of non-uniform sampling.

preprint2014arXiv

Recovery guarantees for exemplar-based clustering

For a certain class of distributions, we prove that the linear programming relaxation of $k$-medoids clustering---a variant of $k$-means clustering where means are replaced by exemplars from within the dataset---distinguishes points drawn from nonoverlapping balls with high probability once the number of points drawn and the separation distance between any two balls are sufficiently large. Our results hold in the nontrivial regime where the separation distance is small enough that points drawn from different balls may be closer to each other than points drawn from the same ball; in this case, clustering by thresholding pairwise distances between points can fail. We also exhibit numerical evidence of high-probability recovery in a substantially more permissive regime.

preprint2013arXiv

Near-optimal compressed sensing guarantees for total variation minimization

Consider the problem of reconstructing a multidimensional signal from an underdetermined set of measurements, as in the setting of compressed sensing. Without any additional assumptions, this problem is ill-posed. However, for signals such as natural images or movies, the minimal total variation estimate consistent with the measurements often produces a good approximation to the underlying signal, even if the number of measurements is far smaller than the ambient dimensionality. This paper extends recent reconstruction guarantees for two-dimensional images to signals of arbitrary dimension d>1 and to isotropic total variation problems. To be precise, we show that a multidimensional signal x can be reconstructed from O(sd*log(N^d)) linear measurements using total variation minimization to within a factor of the best s-term approximation of its gradient. The reconstruction guarantees we provide are necessarily optimal up to polynomial factors in the spatial dimension d.

preprint2013arXiv

Significance testing without truth

A popular approach to significance testing proposes to decide whether the given hypothesized statistical model is likely to be true (or false). Statistical decision theory provides a basis for this approach by requiring every significance test to make a decision about the truth of the hypothesis/model under consideration. Unfortunately, many interesting and useful models are obviously false (that is, not exactly true) even before considering any data. Fortunately, in practice a significance test need only gauge the consistency (or inconsistency) of the observed data with the assumed hypothesis/model -- without enquiring as to whether the assumption is likely to be true (or false), or whether some alternative is likely to be true (or false). In this practical formulation, a significance test rejects a hypothesis/model only if the observed data is highly improbable when calculating the probability while assuming the hypothesis being tested; the significance test only gauges whether the observed data likely invalidates the assumed hypothesis, and cannot decide that the assumption -- however unmistakably false -- is likely to be false a priori, without any data.

preprint2013arXiv

Stable and robust sampling strategies for compressive imaging

In many signal processing applications, one wishes to acquire images that are sparse in transform domains such as spatial finite differences or wavelets using frequency domain samples. For such applications, overwhelming empirical evidence suggests that superior image reconstruction can be obtained through variable density sampling strategies that concentrate on lower frequencies. The wavelet and Fourier transform domains are not incoherent because low-order wavelets and low-order frequencies are correlated, so compressive sensing theory does not immediately imply sampling strategies and reconstruction guarantees. In this paper we turn to a more refined notion of coherence -- the so-called local coherence -- measuring for each sensing vector separately how correlated it is to the sparsity basis. For Fourier measurements and Haar wavelet sparsity, the local coherence can be controlled and bounded explicitly, so for matrices comprised of frequencies sampled from a suitable inverse square power-law density, we can prove the restricted isometry property with near-optimal embedding dimensions. Consequently, the variable-density sampling strategy we provide allows for image reconstructions that are stable to sparsity defects and robust to measurement noise. Our results cover both reconstruction by $\ell_1$-minimization and by total variation minimization. The local coherence framework developed in this paper should be of independent interest in sparse recovery problems more generally, as it implies that for optimal sparse recovery results, it suffices to have bounded \emph{average} coherence from sensing basis to sparsity basis -- as opposed to bounded maximal coherence -- as long as the sampling strategy is adapted accordingly.

preprint2013arXiv

Stable image reconstruction using total variation minimization

This article presents near-optimal guarantees for accurate and robust image recovery from under-sampled noisy measurements using total variation minimization. In particular, we show that from O(slog(N)) nonadaptive linear measurements, an image can be reconstructed to within the best s-term approximation of its gradient up to a logarithmic factor, and this factor can be removed by taking slightly more measurements. Along the way, we prove a strengthened Sobolev inequality for functions lying in the null space of suitably incoherent matrices.

preprint2013arXiv

Testing goodness-of-fit for logistic regression

Explicitly accounting for all applicable independent variables, even when the model being tested does not, is critical in testing goodness-of-fit for logistic regression. This can increase statistical power by orders of magnitude.

preprint2013arXiv

Testing Hardy-Weinberg equilibrium with a simple root-mean-square statistic

We provide evidence that a root-mean-square test of goodness-of-fit can be significantly more powerful than state-of-the-art exact tests in detecting deviations from Hardy-Weinberg equilibrium. Unlike Pearson's chi-square test, the log--likelihood-ratio test, and Fisher's exact test, which are sensitive to relative discrepancies between genotypic frequencies, the root-mean-square test is sensitive to absolute discrepancies. This can increase statistical power, as we demonstrate using benchmark datasets and through asymptotic analysis. With the aid of computers, exact P-values for the root-mean-square statistic can be calculated eeffortlessly, and can be easily implemented using the author's freely available code.

preprint2012arXiv

A comparison of the discrete Kolmogorov-Smirnov statistic and the Euclidean distance

Goodness-of-fit tests gauge whether a given set of observations is consistent (up to expected random fluctuations) with arising as independent and identically distributed (i.i.d.) draws from a user-specified probability distribution known as the "model." The standard gauges involve the discrepancy between the model and the empirical distribution of the observed draws. Some measures of discrepancy are cumulative; others are not. The most popular cumulative measure is the Kolmogorov-Smirnov statistic; when all probability distributions under consideration are discrete, a natural noncumulative measure is the Euclidean distance between the model and the empirical distributions. In the present paper, both mathematical analysis and its illustration via various data sets indicate that the Kolmogorov-Smirnov statistic tends to be more powerful than the Euclidean distance when there is a natural ordering for the values that the draws can take -- that is, when the data is ordinal -- whereas the Euclidean distance is more reliable and more easily understood than the Kolmogorov-Smirnov statistic when there is no natural ordering (or partial order) -- that is, when the data is nominal.

preprint2012arXiv

A symbol-based algorithm for decoding bar codes

We investigate the problem of decoding a bar code from a signal measured with a hand-held laser-based scanner. Rather than formulating the inverse problem as one of binary image reconstruction, we instead incorporate the symbology of the bar code into the reconstruction algorithm directly, and search for a sparse representation of the UPC bar code with respect to this known dictionary. Our approach significantly reduces the degrees of freedom in the problem, allowing for accurate reconstruction that is robust to noise and unknown parameters in the scanning device. We propose a greedy reconstruction algorithm and provide robust reconstruction guarantees. Numerical examples illustrate the insensitivity of our symbology-based reconstruction to both imprecise model parameters and noise on the scanned measurements.

preprint2012arXiv

Stability for second-order chaotic sigma delta quantization

We prove that that second-order (double-loop) chaotic sigma-delta schemes are stable - within a certain parameter range, all state variables of the system are guaranteed to remain uniformly bounded. To our knowledge this is the first general stability result for chaotic sigma-delta schemes of order greater than one. Invariably as the amount of expansion added to the system is increased, the dynamic range of the input must get smaller for stability to be guaranteed. We give explicit bounds on this trade-off and verify through numerical simulation that these bounds are near-optimal.

preprint2012arXiv

Two-subspace Projection Method for Coherent Overdetermined Systems

We present a Projection onto Convex Sets (POCS) type algorithm for solving systems of linear equations. POCS methods have found many applications ranging from computer tomography to digital signal and image processing. The Kaczmarz method is one of the most popular solvers for overdetermined systems of linear equations due to its speed and simplicity. Here we introduce and analyze an extension of the Kaczmarz method that iteratively projects the estimate onto a solution space given by two randomly selected rows. We show that this projection algorithm provides exponential convergence to the solution in expectation. The convergence rate improves upon that of the standard randomized Kaczmarz method when the system has correlated rows. Experimental results confirm that in this case our method significantly outperforms the randomized Kaczmarz method.

preprint2012arXiv

Two-subspace Projection Method for Coherent Overdetermined Systems (Technical Report)

In this technical report we present a Projection onto Convex Sets (POCS) type algorithm for solving systems of linear equations. POCS methods have found many applications ranging from computer tomography to digital signal and image processing. The Kaczmarz method is one of the most popular solvers for overdetermined systems of linear equations due to its speed and simplicity. Here we introduce and analyze an extension of the Kaczmarz method which iteratively projects the estimate onto a solution space given from two randomly selected rows. We show that this projection algorithm provides exponential convergence to the solution in expectation. The convergence rate significantly improves upon that of the standard randomized Kaczmarz method when the system has coherent rows. We also show that the method is robust to noise, and converges exponentially in expectation to the noise floor. Experimental results are provided which confirm that in the coherent case our method significantly outperforms the randomized Kaczmarz method.

preprint2011arXiv

Chi-square and classical exact tests often wildly misreport significance; the remedy lies in computers

If a discrete probability distribution in a model being tested for goodness-of-fit is not close to uniform, then forming the Pearson chi-square statistic can involve division by nearly zero. This often leads to serious trouble in practice -- even in the absence of round-off errors -- as the present article illustrates via numerous examples. Fortunately, with the now widespread availability of computers, avoiding all the trouble is simple and easy: without the problematic division by nearly zero, the actual values taken by goodness-of-fit statistics are not humanly interpretable, but black-box computer programs can rapidly calculate their precise significance.

preprint2011arXiv

Computing the confidence levels for a root-mean-square test of goodness-of-fit

The classic chi-squared statistic for testing goodness-of-fit has long been a cornerstone of modern statistical practice. The statistic consists of a sum in which each summand involves division by the probability associated with the corresponding bin in the distribution being tested for goodness-of-fit. Typically this division should precipitate rebinning to uniformize the probabilities associated with the bins, in order to make the test reasonably powerful. With the now widespread availability of computers, there is no longer any need for this. The present paper provides efficient black-box algorithms for calculating the asymptotic confidence levels of a variant on the classic chi-squared test which omits the problematic division. In many circumstances, it is also feasible to compute the exact confidence levels via Monte Carlo simulation.

preprint2011arXiv

Computing the confidence levels for a root-mean-square test of goodness-of-fit, II

This paper extends our earlier article, "Computing the confidence levels for a root-mean-square test of goodness-of-fit;" unlike in the earlier article, the models in the present paper involve parameter estimation -- both the null and alternative hypotheses in the associated tests are composite. We provide efficient black-box algorithms for calculating the asymptotic confidence levels of a variant on the classic chi-squared test. In some circumstances, it is also feasible to compute the exact confidence levels via Monte Carlo simulation.

preprint2011arXiv

Low-rank matrix recovery via iteratively reweighted least squares minimization

We present and analyze an efficient implementation of an iteratively reweighted least squares algorithm for recovering a matrix from a small number of linear measurements. The algorithm is designed for the simultaneous promotion of both a minimal nuclear norm and an approximatively low-rank solution. Under the assumption that the linear measurements fulfill a suitable generalization of the Null Space Property known in the context of compressed sensing, the algorithm is guaranteed to recover iteratively any matrix with an error of the order of the best k-rank approximation. In certain relevant cases, for instance for the matrix completion problem, our version of this algorithm can take advantage of the Woodbury matrix identity, which allows to expedite the solution of the least squares problems required at each iteration. We present numerical experiments that confirm the robustness of the algorithm for the solution of matrix completion problems, and demonstrate its competitiveness with respect to other techniques proposed recently in the literature.

preprint2011arXiv

Sparse Legendre expansions via $\ell_1$ minimization

We consider the problem of recovering polynomials that are sparse with respect to the basis of Legendre polynomials from a small number of random samples. In particular, we show that a Legendre s-sparse polynomial of maximal degree N can be recovered from m = O(s log^4(N)) random samples that are chosen independently according to the Chebyshev probability measure. As an efficient recovery method, l1-minimization can be used. We establish these results by verifying the restricted isometry property of a preconditioned random Legendre matrix. We then extend these results to a large class of orthogonal polynomial systems, including the Jacobi polynomials, of which the Legendre polynomials are a special case. Finally, we transpose these results into the setting of approximate recovery for functions in certain infinite-dimensional function spaces.

preprint2011arXiv

Sparse recovery for spherical harmonic expansions

We show that sparse spherical harmonic expansions can be efficiently recovered from a small number of randomly chosen samples on the sphere. To establish the main result, we verify the restricted isometry property of an associated preconditioned random measurement matrix using recent estimates on the uniform growth of Jacobi polynomials.

preprint2011arXiv

Weighted eigenfunction estimates with applications to compressed sensing

Using tools from semiclassical analysis, we give weighted L^\infty estimates for eigenfunctions of strictly convex surfaces of revolution. These estimates give rise to new sampling techniques and provide improved bounds on the number of samples necessary for recovering sparse eigenfunction expansions on surfaces of revolution. On the sphere, our estimates imply that any function having an s-sparse expansion in the first N spherical harmonics can be efficiently recovered from its values at m > s N^(1/6) log^4(N) sampling points.

preprint2010arXiv

Freedom through Imperfection: Exploiting the flexibility offered by redundancy in signal processing

This thesis consists of four chapters. The first two chapters pertain to the design of stable quantization methods for analog to digital conversion, while the third and fourth chapters concern problems related to compressive sensing.

preprint2010arXiv

Lower bounds for the error decay incurred by coarse quantization schemes

Several analog-to-digital conversion methods for bandlimited signals used in applications, such as Sigma Delta quantization schemes, employ coarse quantization coupled with oversampling. The standard mathematical model for the error accrued from such methods measures the performance of a given scheme by the rate at which the associated reconstruction error decays as a function of the oversampling ratio L > 1. It was recently shown that exponential accuracy of the form O(2(-r L)) can be achieved by appropriate one-bit Sigma Delta modulation schemes. However, the best known achievable rate constants r in this setting differ significantly from the general information theoretic lower bound. In this paper, we provide the first lower bound specific to coarse quantization, thus narrowing the gap between existing upper and lower bounds. In particular, our results imply a quantitative correspondence between the maximal signal amplitude and the best possible error decay rate. Our method draws from the theory of large deviations.

preprint2010arXiv

On the complexity of Mumford-Shah type regularization, viewed as a relaxed sparsity constraint

We show that inverse problems with a truncated quadratic regularization are NP-hard in general to solve, or even approximate up to an additive error. This stands in contrast to the case corresponding to a finite-dimensional approximation to the Mumford-Shah functional, where the operator involved is the identity and for which polynomial-time solutions are known. Consequently, we confirm the infeasibility of any natural extension of the Mumford-Shah functional to general inverse problems. A connection between truncated quadratic minimization and sparsity-constrained minimization is also discussed.

preprint2010arXiv

Quiet sigma delta quantization, and global convergence for a class of asymmetric piecewise affine maps

In this paper, we introduce a family of second-order sigma delta quantization schemes for analog-to-digital conversion which are `quiet' : quantization output is guaranteed to fall to zero at the onset of vanishing input. In the process, we prove that the origin is a globally attractive fixed point for the related family of asymmetrically-damped piecewise affine maps. Our proof of convergence is twofold: first, we construct a trapping set using a Lyapunov-type argument; we then take advantage of the asymmetric structure of the maps under consideration to prove convergence to the origin from within this trapping set.

Rachel Ward

What is connected

Connect this record

See the researcher in context

Building this map preview

47 published item(s)

Bootstrapping the error of Oja's algorithm

How catastrophic can catastrophic forgetting be in linear regression?

Implicit Regularization and Convergence for Weight Normalization

Sample Efficiency of Data Augmentation Consistency Regularization

The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance

Arbitrary-length analogs to de Bruijn sequences

Generalization Bounds for Sparse Random Feature Expansions

Streaming k-PCA: Efficient guarantees for Oja's algorithm, beyond rank-one updates

Linear Convergence of Adaptive Stochastic Gradient Descent

Matrix Concentration for Products

A polynomial-time relaxation of the Gromov-Hausdorff distance

Clustering subgaussian mixtures by semidefinite programming

Exact Recovery of Chaotic Systems from Highly Corrupted Data

Fast Cross-Polytope Locality-Sensitive Hashing

One-bit compressive sensing with norm estimation

The local convexity of solving systems of quadratic equations

The sample complexity of weighted sparse approximation

A unified framework for linear dimensionality reduction in L1

An arithmetic-geometric mean inequality for products of three matrices

Compressive Sensing with Redundant Dictionaries and Structured Measurements

Interpolation via weighted $l_1$ minimization

Relax, no need to round: integrality of clustering formulations

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

Completing Any Low-rank Matrix, Provably

Recovery guarantees for exemplar-based clustering

Near-optimal compressed sensing guarantees for total variation minimization

Significance testing without truth

Stable and robust sampling strategies for compressive imaging

Stable image reconstruction using total variation minimization

Testing goodness-of-fit for logistic regression

Testing Hardy-Weinberg equilibrium with a simple root-mean-square statistic

A comparison of the discrete Kolmogorov-Smirnov statistic and the Euclidean distance

A symbol-based algorithm for decoding bar codes

Stability for second-order chaotic sigma delta quantization

Two-subspace Projection Method for Coherent Overdetermined Systems

Two-subspace Projection Method for Coherent Overdetermined Systems (Technical Report)

Chi-square and classical exact tests often wildly misreport significance; the remedy lies in computers

Computing the confidence levels for a root-mean-square test of goodness-of-fit

Computing the confidence levels for a root-mean-square test of goodness-of-fit, II

Low-rank matrix recovery via iteratively reweighted least squares minimization

Sparse Legendre expansions via $\ell_1$ minimization

Sparse recovery for spherical harmonic expansions

Weighted eigenfunction estimates with applications to compressed sensing

Freedom through Imperfection: Exploiting the flexibility offered by redundancy in signal processing

Lower bounds for the error decay incurred by coarse quantization schemes

On the complexity of Mumford-Shah type regularization, viewed as a relaxed sparsity constraint

Quiet sigma delta quantization, and global convergence for a class of asymmetric piecewise affine maps