Source author record

Zeyuan Allen-Zhu

Zeyuan Allen-Zhu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms math.OC Machine Learning math.NA Neural and Evolutionary Computing Distributed, Parallel, and Cluster Computing math.PR Numerical Analysis Computational Geometry Discrete Mathematics Information Theory math.IT math.SP math.ST Networking and Internet Architecture Neurons and Cognition Statistics Theory

Catalog footprint

What is connected

16works

17topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Feature Purification: How Adversarial Training Performs Robust Deep Learning

Despite the empirical success of using Adversarial Training to defend deep learning models against adversarial perturbations, so far, it still remains rather unclear what the principles are behind the existence of adversarial perturbations, and what adversarial training does to the neural network to remove them. In this paper, we present a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network; and more importantly, one of the goals of adversarial training is to remove such mixtures to purify hidden weights. We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly initialized gradient descent indeed satisfies this principle. Technically, we give, to the best of our knowledge, the first result proving that the following two can hold simultaneously for training a neural network with ReLU activation. (1) Training over the original data is indeed non-robust to small adversarial perturbations of some radius. (2) Adversarial training, even with an empirical perturbation algorithm such as FGM, can in fact be provably robust against ANY perturbations of the same radius. Finally, we also prove a complexity lower bound, showing that low complexity models such as linear classifiers, low-degree polynomials, or even the neural tangent kernel for this network, CANNOT defend against perturbations of this same radius, no matter what algorithms are used to train them.

preprint2020arXiv

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized? In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network. On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order variant of NTK), and connect it to the SGD theory of escaping saddle points.

preprint2020arXiv

What Can ResNet Learn Efficiently, Going Beyond Kernels?

How can neural networks such as ResNet efficiently learn CIFAR-10 with test accuracy more than 96%, while other methods, especially kernel methods, fall relatively behind? Can we more provide theoretical justifications for this gap? Recently, there is an influential line of work relating neural networks to kernels in the over-parameterized regime, proving they can learn certain concept class that is also learnable by kernels with similar test error. Yet, can neural networks provably learn some concept class BETTER than kernels? We answer this positively in the distribution-free setting. We prove neural networks can efficiently learn a notable class of functions, including those defined by three-layer residual networks with smooth activations, without any distributional assumption. At the same time, we prove there are simple functions in this class such that with the same number of training examples, the test error obtained by neural networks can be MUCH SMALLER than ANY kernel method, including neural tangent kernels (NTK). The main intuition is that multi-layer neural networks can implicitly perform hierarchical learning using different layers, which reduces the sample complexity comparing to "one-shot" learning algorithms such as kernel methods. In a follow-up work [2], this theory of hierarchical learning is further strengthened to incorporate the "backward feature correction" process when training deep networks. In the end, we also prove a computation complexity advantage of ResNet with respect to other learning methods including linear regression over arbitrary feature mappings.

preprint2016arXiv

Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling

Accelerated coordinate descent is widely used in optimization due to its cheap per-iteration cost and scalability to large-scale problems. Up to a primal-dual transformation, it is also the same as accelerated stochastic gradient descent that is one of the central methods used in machine learning. In this paper, we improve the best known running time of accelerated coordinate descent by a factor up to $\sqrt{n}$. Our improvement is based on a clean, novel non-uniform sampling that selects each coordinate with a probability proportional to the square root of its smoothness parameter. Our proof technique also deviates from the classical estimation sequence technique used in prior work. Our speed-up applies to important problems such as empirical risk minimization and solving linear systems, both in theory and in practice.

preprint2016arXiv

Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters

The amount of data available in the world is growing faster than our ability to deal with it. However, if we take advantage of the internal \emph{structure}, data may become much smaller for machine learning purposes. In this paper we focus on one of the fundamental machine learning tasks, empirical risk minimization (ERM), and provide faster algorithms with the help from the clustering structure of the data. We introduce a simple notion of raw clustering that can be efficiently computed from the data, and propose two algorithms based on clustering information. Our accelerated algorithm ClusterACDM is built on a novel Haar transformation applied to the dual space of the ERM problem, and our variance-reduction based algorithm ClusterSVRG introduces a new gradient estimator using clustering. Our algorithms outperform their classical counterparts ACDM and SVRG respectively.

preprint2016arXiv

Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives

Many classical algorithms are found until several years later to outlive the confines in which they were conceived, and continue to be relevant in unforeseen settings. In this paper, we show that SVRG is one such method: being originally designed for strongly convex objectives, it is also very robust in non-strongly convex or sum-of-non-convex settings. More precisely, we provide new analysis to improve the state-of-the-art running times in both settings by either applying SVRG or its novel variant. Since non-strongly convex objectives include important examples such as Lasso or logistic regression, and sum-of-non-convex objectives include famous examples such as stochastic PCA and is even believed to be related to training deep neural nets, our results also imply better performances in these applications.

preprint2016arXiv

Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent

First-order methods play a central role in large-scale machine learning. Even though many variations exist, each suited to a particular problem, almost all such methods fundamentally rely on two types of algorithmic steps: gradient descent, which yields primal progress, and mirror descent, which yields dual progress. We observe that the performances of gradient and mirror descent are complementary, so that faster algorithms can be designed by LINEARLY COUPLING the two. We show how to reconstruct Nesterov's accelerated gradient methods using linear coupling, which gives a cleaner interpretation than Nesterov's original proofs. We also discuss the power of linear coupling by extending it to many other settings that Nesterov's methods cannot apply to.

preprint2016arXiv

Optimal Black-Box Reductions Between Optimization Objectives

The diverse world of machine learning applications has given rise to a plethora of algorithms and optimization methods, finely tuned to the specific regression or classification task at hand. We reduce the complexity of algorithm design for machine learning by reductions: we develop reductions that take a method developed for one setting and apply it to the entire spectrum of smoothness and strong-convexity in applications. Furthermore, unlike existing results, our new reductions are OPTIMAL and more PRACTICAL. We show how these new reductions give rise to new and faster running times on training linear classifiers for various families of loss functions, and conclude with experiments showing their successes also in practice.

preprint2016arXiv

Optimization Algorithms for Faster Computational Geometry

We study two fundamental problems in computational geometry: finding the maximum inscribed ball (MaxIB) inside a bounded polyhedron defined by $m$ hyperplanes, and the minimum enclosing ball (MinEB) of a set of $n$ points, both in $d$-dimensional space. We improve the running time of iterative algorithms on MaxIB from $\tilde{O}(m d α^3 / \varepsilon^3)$ to $\tilde{O}(md + m \sqrt{d} α/ \varepsilon)$, a speed-up up to $\tilde{O}(\sqrt{d} α^2 / \varepsilon^2)$, and MinEB from $\tilde{O}(n d / \sqrt{\varepsilon})$ to $\tilde{O}(nd + n \sqrt{d} / \sqrt{\varepsilon})$, a speed-up up to $\tilde{O}(\sqrt{d})$. Our improvements are based on a novel saddle-point optimization framework. We propose a new algorithm $\mathtt{L1L2SPSolver}$ for solving a class of regularized saddle-point problems, and apply a randomized Hadamard space rotation which is a technique borrowed from compressive sensing. Interestingly, the motivation of using Hadamard rotation solely comes from our optimization view but not the original geometry problem: indeed, it is not immediately clear why MaxIB or MinEB, as a geometric problem, should be easier to solve if we rotate the space by a unitary matrix. We hope that our optimization perspective sheds lights on solving other geometric problems as well.

preprint2016arXiv

Using Optimization to Obtain a Width-Independent, Parallel, Simpler, and Faster Positive SDP Solver

We study the design of polylogarithmic depth algorithms for approximately solving packing and covering semidefinite programs (or positive SDPs for short). This is a natural SDP generalization of the well-studied positive LP problem. Although positive LPs can be solved in polylogarithmic depth while using only $\tilde{O}(\log^{2} n/\varepsilon^2)$ parallelizable iterations, the best known positive SDP solvers due to Jain and Yao require $O(\log^{14} n /\varepsilon^{13})$ parallelizable iterations. Several alternative solvers have been proposed to reduce the exponents in the number of iterations. However, the correctness of the convergence analyses in these works has been called into question, as they both rely on algebraic monotonicity properties that do not generalize to matrix algebra. In this paper, we propose a very simple algorithm based on the optimization framework proposed for LP solvers. Our algorithm only needs $\tilde{O}(\log^2 n / \varepsilon^2)$ iterations, matching that of the best LP solver. To surmount the obstacles encountered by previous approaches, our analysis requires a new matrix inequality that extends Lieb-Thirring's inequality, and a sign-consistent, randomized variant of the gradient truncation technique proposed in.

preprint2016arXiv

Using Optimization to Solve Positive LPs Faster in Parallel

Positive linear programs (LP), also known as packing and covering linear programs, are an important class of problems that bridges computer science, operations research, and optimization. Despite the consistent efforts on this problem, all known nearly-linear-time algorithms require $\tilde{O}(\varepsilon^{-4})$ iterations to converge to $1\pm \varepsilon$ approximate solutions. This $\varepsilon^{-4}$ dependence has not been improved since 1993, and limits the performance of parallel implementations for such algorithms. Moreover, previous algorithms and their analyses rely on update steps and convergence arguments that are combinatorial in nature and do not seem to arise naturally from an optimization viewpoint. In this paper, we leverage new insights from optimization theory to construct a novel algorithm that breaks the longstanding $\varepsilon^{-4}$ barrier. Our algorithm has a simple analysis and a clear motivation. Our work introduces a number of novel techniques, such as the combined application of gradient descent and mirror descent, and a truncated, smoothed version of the standard multiplicative weight update, which may be of independent interest.

preprint2016arXiv

Variance Reduction for Faster Non-Convex Optimization

We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $Ω(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.

preprint2015arXiv

Expanders via Local Edge Flips

Designing distributed and scalable algorithms to improve network connectivity is a central topic in peer-to-peer networks. In this paper we focus on the following well-known problem: given an $n$-node $d$-regular network for $d=Ω(\log n)$, we want to design a decentralized, local algorithm that transforms the graph into one that has good connectivity properties (low diameter, expansion, etc.) without affecting the sparsity of the graph. To this end, Mahlmann and Schindelhauer introduced the random "flip" transformation, where in each time step, a random pair of vertices that have an edge decide to `swap a neighbor'. They conjectured that performing $O(n d)$ such flips at random would convert any connected $d$-regular graph into a $d$-regular expander graph, with high probability. However, the best known upper bound for the number of steps is roughly $O(n^{17} d^{23})$, obtained via a delicate Markov chain comparison argument. Our main result is to prove that a natural instantiation of the random flip produces an expander in at most $O(n^2 d^2 \sqrt{\log n})$ steps, with high probability. Our argument uses a potential-function analysis based on the matrix exponential, together with the recent beautiful results on the higher-order Cheeger inequality of graphs. We also show that our technique can be used to analyze another well-studied random process known as the `random switch', and show that it produces an expander in $O(n d)$ steps with high probability.

preprint2015arXiv

Restricted Isometry Property for General p-Norms

The Restricted Isometry Property (RIP) is a fundamental property of a matrix which enables sparse recovery. Informally, an $m \times n$ matrix satisfies RIP of order $k$ for the $\ell_p$ norm, if $\|Ax\|_p \approx \|x\|_p$ for every vector $x$ with at most $k$ non-zero coordinates. For every $1 \leq p < \infty$ we obtain almost tight bounds on the minimum number of rows $m$ necessary for the RIP property to hold. Prior to this work, only the cases $p = 1$, $1 + 1 / \log k$, and $2$ were studied. Interestingly, our results show that the case $p = 2$ is a "singularity" point: the optimal number of rows $m$ is $\widetildeΘ(k^{p})$ for all $p\in [1,\infty)\setminus \{2\}$, as opposed to $\widetildeΘ(k)$ for $k=2$. We also obtain almost tight bounds for the column sparsity of RIP matrices and discuss implications of our results for the Stable Sparse Recovery problem.

preprint2015arXiv

Spectral Sparsification and Regret Minimization Beyond Matrix Multiplicative Updates

In this paper, we provide a novel construction of the linear-sized spectral sparsifiers of Batson, Spielman and Srivastava [BSS14]. While previous constructions required $Ω(n^4)$ running time [BSS14, Zou12], our sparsification routine can be implemented in almost-quadratic running time $O(n^{2+\varepsilon})$. The fundamental conceptual novelty of our work is the leveraging of a strong connection between sparsification and a regret minimization problem over density matrices. This connection was known to provide an interpretation of the randomized sparsifiers of Spielman and Srivastava [SS11] via the application of matrix multiplicative weight updates (MWU) [CHS11, Vis14]. In this paper, we explain how matrix MWU naturally arises as an instance of the Follow-the-Regularized-Leader framework and generalize this approach to yield a larger class of updates. This new class allows us to accelerate the construction of linear-sized spectral sparsifiers, and give novel insights on the motivation behind Batson, Spielman and Srivastava [BSS14].

preprint2014arXiv

Johnson-Lindenstrauss Compression with Neuroscience-Based Constraints

Johnson-Lindenstrauss (JL) matrices implemented by sparse random synaptic connections are thought to be a prime candidate for how convergent pathways in the brain compress information. However, to date, there is no complete mathematical support for such implementations given the constraints of real neural tissue. The fact that neurons are either excitatory or inhibitory implies that every so implementable JL matrix must be sign-consistent (i.e., all entries in a single column must be either all non-negative or all non-positive), and the fact that any given neuron connects to a relatively small subset of other neurons implies that the JL matrix had better be sparse. We construct sparse JL matrices that are sign-consistent, and prove that our construction is essentially optimal. Our work answers a mathematical question that was triggered by earlier work and is necessary to justify the existence of JL compression in the brain, and emphasizes that inhibition is crucial if neurons are to perform efficient, correlation-preserving compression.

Zeyuan Allen-Zhu

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

Feature Purification: How Adversarial Training Performs Robust Deep Learning

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

What Can ResNet Learn Efficiently, Going Beyond Kernels?

Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling

Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters

Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives

Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent

Optimal Black-Box Reductions Between Optimization Objectives

Optimization Algorithms for Faster Computational Geometry

Using Optimization to Obtain a Width-Independent, Parallel, Simpler, and Faster Positive SDP Solver

Using Optimization to Solve Positive LPs Faster in Parallel

Variance Reduction for Faster Non-Convex Optimization

Expanders via Local Edge Flips

Restricted Isometry Property for General p-Norms

Spectral Sparsification and Regret Minimization Beyond Matrix Multiplicative Updates

Johnson-Lindenstrauss Compression with Neuroscience-Based Constraints