Source author record

Damek Davis

Damek Davis appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.OC Machine Learning math.ST Computer Vision Statistics Theory

Catalog footprint

What is connected

17works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Average Gradient Outer Product in kernel regression provably recovers the central subspace for multi-index models

We study a prototypical situation when a learned predictor can discover useful low-dimensional structure in data, while using fewer samples than are needed for accurate prediction. Specifically, we consider the problem of recovering a multi-index polynomial $f^*(x)=h(Ux)$, with $U\in\mathbb{R}^{r\times d}$ and $r\ll d$, from finitely many data/label pairs. Importantly, the target function depends on input $x$ only through the projection onto an unknown $r$-dimensional central subspace. The algorithm we analyze is appealingly simple: fit kernel ridge regression (KRR) to the data and compute the Average Gradient Outer Product (AGOP) from the fitted predictor. Our main results show that under reasonable assumptions the top $r$-dimensional eigenspace of AGOP provably recovers the central subspace, even in regimes when the prediction error remains large. Specifically, if the target function $f^*$ has degree $p^*$, it is known that $n\asymp d^{p^*}$ samples are necessary for KRR to achieve accurate prediction. In contrast, we show that if a low degree $p$ component of $f^*$ already carries all relevant directions for prediction, subspace recovery occurs in the much lower sample regime $n\asymp d^{p+δ}$ for any $δ\in(0,1)$. Our results thus demonstrate a separation between prediction and representation, and provide an explanation for why iterative kernel methods such as Recursive Feature Machines (RFM) can be sample-efficient in practice.

preprint2026arXiv

When do spectral gradient updates help in deep learning?

Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient's nuclear-to-Frobenius ratio grows with the data dimension while the stable rank of the activations remains bounded, so the predicted advantage of spectral updates scales with dimension. We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training, where we find that intermediate activations have low-stable-rank throughout training and the corresponding gradients maintain large nuclear-to-Frobenius ratios. Together, these results identify conditions for spectral gradient methods, such as Muon, to be effective in training deep networks and transformers.

preprint2023arXiv

Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization

We show that the subgradient method converges only to local minimizers when applied to generic Lipschitz continuous and subdifferentially regular functions that are definable in an o-minimal structure. At a high level, the argument we present is appealingly transparent: we interpret the nonsmooth dynamics as an approximate Riemannian gradient method on a certain distinguished submanifold that captures the nonsmooth activity of the function. In the process, we develop new regularity conditions in nonsmooth analysis that parallel the stratification conditions of Whitney, Kuo, and Verdier and extend stochastic processes techniques of Pemantle.

preprint2023arXiv

Asymptotic normality and optimality in nonsmooth stochastic approximation

In their seminal work, Polyak and Juditsky showed that stochastic approximation algorithms for solving smooth equations enjoy a central limit theorem. Moreover, it has since been argued that the asymptotic covariance of the method is best possible among any estimation procedure in a local minimax sense of Hájek and Le Cam. A long-standing open question in this line of work is whether similar guarantees hold for important non-smooth problems, such as stochastic nonlinear programming or stochastic variational inequalities. In this work, we show that this is indeed the case.

preprint2022arXiv

A gradient sampling method with complexity guarantees for Lipschitz functions in high and low dimensions

Zhang et al. introduced a novel modification of Goldstein's classical subgradient method, with an efficiency guarantee of $O(\varepsilon^{-4})$ for minimizing Lipschitz functions. Their work, however, makes use of a nonstandard subgradient oracle model and requires the function to be directionally differentiable. In this paper, we show that both of these assumptions can be dropped by simply adding a small random perturbation in each step of their algorithm. The resulting method works on any Lipschitz function whose value and gradient can be evaluated at points of differentiability. We additionally present a new cutting plane algorithm that achieves better efficiency in low dimensions: $O(d\varepsilon^{-3})$ for Lipschitz functions and $O(d\varepsilon^{-2})$ for those that are weakly convex.

preprint2022arXiv

A superlinearly convergent subgradient method for sharp semismooth problems

Subgradient methods comprise a fundamental class of nonsmooth optimization algorithms. Classical results show that certain subgradient methods converge sublinearly for general Lipschitz convex functions and converge linearly for convex functions that grow sharply away from solutions. Recent work has moreover extended these results to certain nonconvex problems. In this work we seek to improve the complexity of these algorithms, asking: is it possible to design a superlinearly convergent subgradient method? We provide a positive answer to this question for a broad class of sharp semismooth functions.

preprint2021arXiv

Conservative and semismooth derivatives are equivalent for semialgebraic maps

Subgradient and Newton algorithms for nonsmooth optimization require generalized derivatives to satisfy subtle approximation properties: conservativity for the former and semismoothness for the latter. Though these two properties originate in entirely different contexts, we show that in the semi-algebraic setting they are equivalent. Both properties for a generalized derivative simply require it to coincide with the standard directional derivative on the tangent spaces of some partition of the domain into smooth manifolds. An appealing byproduct is a new short proof that semi-algebraic maps are semismooth relative to the Clarke Jacobian.

preprint2021arXiv

Proximal methods avoid active strict saddles of weakly convex functions

We introduce a geometrically transparent strict saddle property for nonsmooth functions. This property guarantees that simple proximal algorithms on weakly convex problems converge only to local minimizers, when randomly initialized. We argue that the strict saddle property may be a realistic assumption in applications, since it provably holds for generic semi-algebraic optimization problems.

preprint2016arXiv

The Asynchronous PALM Algorithm for Nonsmooth Nonconvex Problems

We introduce the Asynchronous PALM algorithm, a new extension of the Proximal Alternating Linearized Minimization (PALM) algorithm for solving nonsmooth, nonconvex optimization problems. Like the PALM algorithm, each step of the Asynchronous PALM algorithm updates a single block of coordinates; but unlike the PALM algorithm, the Asynchronous PALM algorithm eliminates the need for sequential updates that occur one after the other. Instead, our new algorithm allows each of the coordinate blocks to be updated asynchronously and in any order, which means that any number of computing cores can compute updates in parallel without synchronizing their computations. In practice, this asynchronization strategy often leads to speedups that increase linearly with the number of computing cores. We introduce two variants of the Asynchronous PALM algorithm, one stochastic and one deterministic. In the stochastic \textit{and} deterministic cases, we show that cluster points of the algorithm are stationary points. In the deterministic case, we show that the algorithm converges globally whenever the Kurdyka-Łojasiewicz property holds for a function closely related to the objective function, and we derive its convergence rate in a common special case. Finally, we provide a concrete case in which our assumptions hold.

preprint2016arXiv

The Sound of APALM Clapping: Faster Nonsmooth Nonconvex Optimization with Stochastic Asynchronous PALM

We introduce the Stochastic Asynchronous Proximal Alternating Linearized Minimization (SAPALM) method, a block coordinate stochastic proximal-gradient method for solving nonconvex, nonsmooth optimization problems. SAPALM is the first asynchronous parallel optimization method that provably converges on a large class of nonconvex, nonsmooth problems. We prove that SAPALM matches the best known rates of convergence --- among synchronous or asynchronous methods --- on this problem class. We provide upper bounds on the number of workers for which we can expect to see a linear speedup, which match the best bounds known for less complex problems, and show that in practice SAPALM achieves this linear speedup. We demonstrate state-of-the-art performance on several matrix factorization problems.

preprint2015arXiv

A Three-Operator Splitting Scheme and its Optimization Applications

Operator splitting schemes have been successfully used in computational sciences to reduce complex problems into a series of simpler subproblems. Since 1950s, these schemes have been widely used to solve problems in PDE and control. Recently, large-scale optimization problems in machine learning, signal processing, and imaging have created a resurgence of interest in operator-splitting based algorithms because they often have simple descriptions, are easy to code, and have (nearly) state-of-the-art performance for large-scale optimization problems. Although operator splitting techniques were introduced over 60 years ago, their importance has significantly increased in the past decade. This paper introduces a new operator-splitting scheme for solving a variety of problems that are reduced to a monotone inclusion of three operators, one of which is cocoercive. Our scheme is very simple, and it does not reduce to any existing splitting schemes. Our scheme recovers the existing forward-backward, Douglas-Rachford, and forward-Douglas-Rachford splitting schemes as special cases. Our new splitting scheme leads to a set of new and simple algorithms for a variety of other problems, including the 3-set split feasibility problems, 3-objective minimization problems, and doubly and multiple regularization problems, as well as the simplest extension of the classic ADMM from 2 to 3 blocks of variables. In addition to the basic scheme, we introduce several modifications and enhancements that can improve the convergence rate in practice, including an acceleration that achieves the optimal rate of convergence for strongly monotone inclusions. Finally, we evaluate the algorithm on several applications.

preprint2015arXiv

An $O(n\log(n))$ Algorithm for Projecting Onto the Ordered Weighted $\ell_1$ Norm Ball

The ordered weighted $\ell_1$ (OWL) norm is a newly developed generalization of the Octogonal Shrinkage and Clustering Algorithm for Regression (OSCAR) norm. This norm has desirable statistical properties and can be used to perform simultaneous clustering and regression. In this paper, we show how to compute the projection of an $n$-dimensional vector onto the OWL norm ball in $O(n\log(n))$ operations. In addition, we illustrate the performance of our algorithm on a synthetic regression test.

preprint2015arXiv

Convergence rate analysis of primal-dual splitting schemes

Primal-dual splitting schemes are a class of powerful algorithms that solve complicated monotone inclusions and convex optimization problems that are built from many simpler pieces. They decompose problems that are built from sums, linear compositions, and infimal convolutions of simple functions so that each simple term is processed individually via proximal mappings, gradient mappings, and multiplications by the linear maps. This leads to easily implementable and highly parallelizable or distributed algorithms, which often obtain nearly state-of-the-art performance. In this paper, we analyze a monotone inclusion problem that captures a large class of primal-dual splittings as a special case. We introduce a unifying scheme and use some abstract analysis of the algorithm to prove convergence rates of the proximal point algorithm, forward-backward splitting, Peaceman-Rachford splitting, and forward-backward-forward splitting applied to the model problem. Our ergodic convergence rates are deduced under variable metrics, stepsizes, and relaxation. Our nonergodic convergence rates are the first shown in the literature. Finally, we apply our results to a large class of primal-dual algorithms that are a special case of our scheme and deduce their convergence rates.

preprint2015arXiv

Convergence rate analysis of several splitting schemes

Splitting schemes are a class of powerful algorithms that solve complicated monotone inclusions and convex optimization problems that are built from many simpler pieces. They give rise to algorithms in which the simple pieces of the decomposition are processed individually. This leads to easily implementable and highly parallelizable algorithms, which often obtain nearly state-of-the-art performance. In the first part of this paper, we analyze the convergence rates of several general splitting algorithms and provide examples to prove the tightness of our results. The most general rates are proved for the \emph{fixed-point residual} (FPR) of the Krasnosel'skiĭ-Mann (KM) iteration of nonexpansive operators, where we improve the known big-$O$ rate to little-$o$. We show the tightness of this result and improve it in several special cases. In the second part of this paper, we use the convergence rates derived for the KM iteration to analyze the \emph{objective error} convergence rates for the Douglas-Rachford (DRS), Peaceman-Rachford (PRS), and ADMM splitting algorithms under general convexity assumptions. We show, by way of example, that the rates obtained for these algorithms are tight in all cases and obtain the surprising statement: The DRS algorithm is nearly as fast as the proximal point algorithm (PPA) in the ergodic sense and nearly as slow as the subgradient method in the nonergodic sense. Finally, we provide several applications of our result to feasibility problems, model fitting, and distributed optimization. Our analysis is self-contained, and most results are deduced from a basic lemma that derives convergence rates for summable sequences, a simple diagram that decomposes each relaxed PRS iteration, and fundamental inequalities that relate the FPR to objective error.

preprint2015arXiv

Convergence rate analysis of the forward-Douglas-Rachford splitting scheme

Operator splitting schemes are a class of powerful algorithms that solve complicated monotone inclusion and convex optimization problems that are built from many simpler pieces. They give rise to algorithms in which all simple pieces of the decomposition are processed individually. This leads to easily implementable and highly parallelizable or distributed algorithms, which often obtain nearly state-of-the-art performance. In this paper, we analyze the convergence rate of the forward-Douglas-Rachford splitting (FDRS) algorithm, which is a generalization of the forward-backward splitting (FBS) and Douglas-Rachford splitting (DRS) algorithms. Under general convexity assumptions, we derive the ergodic and nonergodic convergence rates of the FDRS algorithm, and show that these rates are the best possible. Under Lipschitz differentiability assumptions, we show that the best iterate of FDRS converges as quickly as the last iterate of the FBS algorithm. Under strong convexity assumptions, we derive convergence rates for a sequence that strongly converges to a minimizer. Under strong convexity and Lipschitz differentiability assumptions, we show that FDRS converges linearly. We also provide examples where the objective is strongly convex, yet FDRS converges arbitrarily slowly. Finally, we relate the FDRS algorithm to a primal-dual forward-backward splitting scheme and clarify its place among existing splitting methods. Our results show that the FDRS algorithm automatically adapts to the regularity of the objective functions and achieves rates that improve upon the sharp worst case rates that hold in the absence of smoothness and strong convexity.

preprint2015arXiv

Faster convergence rates of relaxed Peaceman-Rachford and ADMM under regularity assumptions

Splitting schemes are a class of powerful algorithms that solve complicated monotone inclusion and convex optimization problems that are built from many simpler pieces. They give rise to algorithms in which the simple pieces of the decomposition are processed individually. This leads to easily implementable and highly parallelizable algorithms, which often obtain nearly state-of-the-art performance. In this paper, we provide a comprehensive convergence rate analysis of the Douglas-Rachford splitting (DRS), Peaceman-Rachford splitting (PRS), and alternating direction method of multipliers (ADMM) algorithms under various regularity assumptions including strong convexity, Lipschitz differentiability, and bounded linear regularity. The main consequence of this work is that relaxed PRS and ADMM automatically adapt to the regularity of the problem and achieve convergence rates that improve upon the (tight) worst-case rates that hold in the absence of such regularity. All of the results are obtained using simple techniques.

preprint2013arXiv

On the Design and Analysis of Multiple View Descriptors

We propose an extension of popular descriptors based on gradient orientation histograms (HOG, computed in a single image) to multiple views. It hinges on interpreting HOG as a conditional density in the space of sampled images, where the effects of nuisance factors such as viewpoint and illumination are marginalized. However, such marginalization is performed with respect to a very coarse approximation of the underlying distribution. Our extension leverages on the fact that multiple views of the same scene allow separating intrinsic from nuisance variability, and thus afford better marginalization of the latter. The result is a descriptor that has the same complexity of single-view HOG, and can be compared in the same manner, but exploits multiple views to better trade off insensitivity to nuisance variability with specificity to intrinsic variability. We also introduce a novel multi-view wide-baseline matching dataset, consisting of a mixture of real and synthetic objects with ground truthed camera motion and dense three-dimensional geometry.

Damek Davis

What is connected

Connect this record

See the researcher in context

Building this map preview

17 published item(s)

Average Gradient Outer Product in kernel regression provably recovers the central subspace for multi-index models

When do spectral gradient updates help in deep learning?

Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization

Asymptotic normality and optimality in nonsmooth stochastic approximation

A gradient sampling method with complexity guarantees for Lipschitz functions in high and low dimensions

A superlinearly convergent subgradient method for sharp semismooth problems

Conservative and semismooth derivatives are equivalent for semialgebraic maps

Proximal methods avoid active strict saddles of weakly convex functions

The Asynchronous PALM Algorithm for Nonsmooth Nonconvex Problems

The Sound of APALM Clapping: Faster Nonsmooth Nonconvex Optimization with Stochastic Asynchronous PALM

A Three-Operator Splitting Scheme and its Optimization Applications

An $O(n\log(n))$ Algorithm for Projecting Onto the Ordered Weighted $\ell_1$ Norm Ball

Convergence rate analysis of primal-dual splitting schemes

Convergence rate analysis of several splitting schemes

Convergence rate analysis of the forward-Douglas-Rachford splitting scheme

Faster convergence rates of relaxed Peaceman-Rachford and ADMM under regularity assumptions

On the Design and Analysis of Multiple View Descriptors