Source author record

Dmitriy Drusvyatskiy

Dmitriy Drusvyatskiy appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.OC Machine Learning math.ST Computer Science and Game Theory math.AG Computational Geometry math.CA math.CO math.NA Numerical Analysis Statistics Theory

Catalog footprint

What is connected

28works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Average Gradient Outer Product in kernel regression provably recovers the central subspace for multi-index models

We study a prototypical situation when a learned predictor can discover useful low-dimensional structure in data, while using fewer samples than are needed for accurate prediction. Specifically, we consider the problem of recovering a multi-index polynomial $f^*(x)=h(Ux)$, with $U\in\mathbb{R}^{r\times d}$ and $r\ll d$, from finitely many data/label pairs. Importantly, the target function depends on input $x$ only through the projection onto an unknown $r$-dimensional central subspace. The algorithm we analyze is appealingly simple: fit kernel ridge regression (KRR) to the data and compute the Average Gradient Outer Product (AGOP) from the fitted predictor. Our main results show that under reasonable assumptions the top $r$-dimensional eigenspace of AGOP provably recovers the central subspace, even in regimes when the prediction error remains large. Specifically, if the target function $f^*$ has degree $p^*$, it is known that $n\asymp d^{p^*}$ samples are necessary for KRR to achieve accurate prediction. In contrast, we show that if a low degree $p$ component of $f^*$ already carries all relevant directions for prediction, subspace recovery occurs in the much lower sample regime $n\asymp d^{p+δ}$ for any $δ\in(0,1)$. Our results thus demonstrate a separation between prediction and representation, and provide an explanation for why iterative kernel methods such as Recursive Feature Machines (RFM) can be sample-efficient in practice.

preprint2026arXiv

High-dimensional Limit of SGD for Diagonal Linear Networks

Understanding the behavior of stochastic gradient methods is a central problem in modern machine learning. Recent work has highlighted diagonal linear networks as a simplified yet expressive setting for analyzing the optimization and generalization properties of neural models. In this work, we show that in the high-dimensional regime, stochastic gradient descent on diagonal linear networks is well-approximated by continuous dynamics governed by a stochastic differential equation (SDE), which explicitly decouples the drift from the gradient noise. We further derive a deterministic partial differential equation whose solution propagates the relevant state of the iterates and characterizes the time evolution of a broad class of observable statistics, including the risk, curvature, and other metrics for optimality. Finally, we show that, under a suitable parametrization, the stochastic dynamics are globally well posed and converge exponentially fast to zero risk with high probability, yielding a fully explicit non-asymptotic description of their long-time behavior. Numerical simulations corroborate our theoretical findings.

preprint2026arXiv

When do spectral gradient updates help in deep learning?

Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient's nuclear-to-Frobenius ratio grows with the data dimension while the stable rank of the activations remains bounded, so the predicted advantage of spectral updates scales with dimension. We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training, where we find that intermediate activations have low-stable-rank throughout training and the corresponding gradients maintain large nuclear-to-Frobenius ratios. Together, these results identify conditions for spectral gradient methods, such as Muon, to be effective in training deep networks and transformers.

preprint2023arXiv

Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization

We show that the subgradient method converges only to local minimizers when applied to generic Lipschitz continuous and subdifferentially regular functions that are definable in an o-minimal structure. At a high level, the argument we present is appealingly transparent: we interpret the nonsmooth dynamics as an approximate Riemannian gradient method on a certain distinguished submanifold that captures the nonsmooth activity of the function. In the process, we develop new regularity conditions in nonsmooth analysis that parallel the stratification conditions of Whitney, Kuo, and Verdier and extend stochastic processes techniques of Pemantle.

preprint2023arXiv

Asymptotic normality and optimality in nonsmooth stochastic approximation

In their seminal work, Polyak and Juditsky showed that stochastic approximation algorithms for solving smooth equations enjoy a central limit theorem. Moreover, it has since been argued that the asymptotic covariance of the method is best possible among any estimation procedure in a local minimax sense of Hájek and Le Cam. A long-standing open question in this line of work is whether similar guarantees hold for important non-smooth problems, such as stochastic nonlinear programming or stochastic variational inequalities. In this work, we show that this is indeed the case.

preprint2022arXiv

A gradient sampling method with complexity guarantees for Lipschitz functions in high and low dimensions

Zhang et al. introduced a novel modification of Goldstein's classical subgradient method, with an efficiency guarantee of $O(\varepsilon^{-4})$ for minimizing Lipschitz functions. Their work, however, makes use of a nonstandard subgradient oracle model and requires the function to be directionally differentiable. In this paper, we show that both of these assumptions can be dropped by simply adding a small random perturbation in each step of their algorithm. The resulting method works on any Lipschitz function whose value and gradient can be evaluated at points of differentiability. We additionally present a new cutting plane algorithm that achieves better efficiency in low dimensions: $O(d\varepsilon^{-3})$ for Lipschitz functions and $O(d\varepsilon^{-2})$ for those that are weakly convex.

preprint2022arXiv

Decision-Dependent Risk Minimization in Geometrically Decaying Dynamic Environments

This paper studies the problem of expected loss minimization given a data distribution that is dependent on the decision-maker's action and evolves dynamically in time according to a geometric decay process. Novel algorithms for both the information setting in which the decision-maker has a first order gradient oracle and the setting in which they have simply a loss function oracle are introduced. The algorithms operate on the same underlying principle: the decision-maker repeatedly deploys a fixed decision over the length of an epoch, thereby allowing the dynamically changing environment to sufficiently mix before updating the decision. The iteration complexity in each of the settings is shown to match existing rates for first and zero order stochastic gradient methods up to logarithmic factors. The algorithms are evaluated on a "semi-synthetic" example using real world data from the SFpark dynamic pricing pilot study; it is shown that the announced prices result in an improvement for the institution's objective (target occupancy), while achieving an overall reduction in parking rates.

preprint2022arXiv

Improved Rates for Derivative Free Gradient Play in Strongly Monotone Games

The influential work of Bravo et al. 2018 shows that derivative free play in strongly monotone games has complexity $O(d^2/\varepsilon^3)$, where $\varepsilon$ is the target accuracy on the expected squared distance to the solution. This note shows that the efficiency estimate is actually $O(d^2/\varepsilon^2)$, which reduces to the known efficiency guarantee for the method in unconstrained optimization. The argument we present simple interprets the method as stochastic gradient play on a slightly perturbed strongly monotone game.

preprint2022arXiv

Multiplayer Performative Prediction: Learning in Decision-Dependent Games

Learning problems commonly exhibit an interesting feedback mechanism wherein the population data reacts to competing decision makers' actions. This paper formulates a new game theoretic framework for this phenomenon, called "multi-player performative prediction". We focus on two distinct solution concepts, namely (i) performatively stable equilibria and (ii) Nash equilibria of the game. The latter equilibria are arguably more informative, but can be found efficiently only when the game is monotone. We show that under mild assumptions, the performatively stable equilibria can be found efficiently by a variety of algorithms, including repeated retraining and the repeated (stochastic) gradient method. We then establish transparent sufficient conditions for strong monotonicity of the game and use them to develop algorithms for finding Nash equilibria. We investigate derivative free methods and adaptive gradient algorithms wherein each player alternates between learning a parametric description of their distribution and gradient steps on the empirical risk. Synthetic and semi-synthetic numerical experiments illustrate the results.

preprint2021arXiv

Conservative and semismooth derivatives are equivalent for semialgebraic maps

Subgradient and Newton algorithms for nonsmooth optimization require generalized derivatives to satisfy subtle approximation properties: conservativity for the former and semismoothness for the latter. Though these two properties originate in entirely different contexts, we show that in the semi-algebraic setting they are equivalent. Both properties for a generalized derivative simply require it to coincide with the standard directional derivative on the tangent spaces of some partition of the domain into smooth manifolds. An appealing byproduct is a new short proof that semi-algebraic maps are semismooth relative to the Clarke Jacobian.

preprint2021arXiv

Proximal methods avoid active strict saddles of weakly convex functions

We introduce a geometrically transparent strict saddle property for nonsmooth functions. This property guarantees that simple proximal algorithms on weakly convex problems converge only to local minimizers, when randomly initialized. We argue that the strict saddle property may be a realistic assumption in applications, since it provably holds for generic semi-algebraic optimization problems.

preprint2016arXiv

Error bounds, quadratic growth, and linear convergence of proximal methods

The proximal gradient algorithm for minimizing the sum of a smooth and a nonsmooth convex function often converges linearly even without strong convexity. One common reason is that a multiple of the step length at each iteration may linearly bound the "error" -- the distance to the solution set. We explain the observed linear convergence intuitively by proving the equivalence of such an error bound to a natural quadratic growth condition. Our approach generalizes to linear convergence analysis for proximal methods (of Gauss-Newton type) for minimizing compositions of nonsmooth functions with smooth mappings. We observe incidentally that short step-lengths in the algorithm indicate near-stationarity, suggesting a reliable termination criterion.

preprint2016arXiv

Level-set methods for convex optimization

Convex optimization problems arising in applications often have favorable objective functions and complicated constraints, thereby precluding first-order methods from being immediately applicable. We describe an approach that exchanges the roles of the objective and constraint functions, and instead approximately solves a sequence of parametric level-set problems. A zero-finding procedure, based on inexact function evaluations and possibly inexact derivative information, leads to an efficient solution scheme for the original problem. We describe the theoretical and practical properties of this approach for a broad range of problems, including low-rank semidefinite optimization, sparse optimization, and generalized linear models for inference.

preprint2016arXiv

Nonsmooth optimization using Taylor-like models: error bounds, convergence, and termination criteria

We consider optimization algorithms that successively minimize simple Taylor-like models of the objective function. Methods of Gauss-Newton type for minimizing the composition of a convex function and a smooth map are common examples. Our main result is an explicit relationship between the step-size of any such algorithm and the slope of the function at a nearby point. Consequently, we (1) show that the step-sizes can be reliably used to terminate the algorithm, (2) prove that as long as the step-sizes tend to zero, every limit point of the iterates is stationary, and (3) show that conditions, akin to classical quadratic growth, imply that the step-sizes linearly bound the distance of the iterates to the solution set. The latter so-called error bound property is typically used to establish linear (or faster) convergence guarantees. Analogous results hold when the step-size is replaced by the square root of the decrease in the model's value. We complete the paper with extensions to when the models are minimized only inexactly.

preprint2016arXiv

The Euclidean Distance Degree of Orthogonally Invariant Matrix Varieties

We show that the Euclidean distance degree of a real orthogonally invariant matrix variety equals the Euclidean distance degree of its restriction to diagonal matrices. We illustrate how this result can greatly simplify calculations in concrete circumstances.

preprint2015arXiv

Counting real critical points of the distance to orthogonally invariant matrix sets

Minimizing the Euclidean distance to a set arises frequently in applications. When the set is algebraic, a measure of complexity of this optimization problem is its number of critical points. In this paper we provide a general framework to compute and count the real smooth critical points of a data matrix on an orthogonally invariant set of matrices. The technique relies on "transfer principles" that allow calculations to be done in the space of singular values of the matrices in the orthogonally invariant set. The calculations often simplify greatly and yield transparent formulas. We illustrate the method on several examples, and compare our results to the recently introduced notion of Euclidean distance degree of an algebraic variety.

preprint2015arXiv

Noisy Euclidean distance realization: robust facial reduction and the Pareto frontier

We present two algorithms for large-scale low-rank Euclidean distance matrix completion problems, based on semidefinite optimization. Our first method works by relating cliques in the graph of the known distances to faces of the positive semidefinite cone, yielding a combinatorial procedure that is provably robust and parallelizable. Our second algorithm is a first order method for maximizing the trace---a popular low-rank inducing regularizer---in the formulation of the problem with a constrained misfit. Both of the methods output a point configuration that can serve as a high-quality initialization for local optimization techniques. Numerical experiments on large-scale sensor localization problems illustrate the two approaches.

preprint2015arXiv

Sweeping by a tame process

We show that any semi-algebraic sweeping process admits piecewise absolutely continuous solutions, and any such bounded trajectory must have finite length. Analogous results hold more generally for sweeping processes definable in o-minimal structures. This extends previous work on (sub)gradient dynamical systems beyond monotone sweeping sets.

preprint2014arXiv

Projection methods in quantum information science

We consider the problem of constructing quantum operations or channels, if they exist, that transform a given set of quantum states $\{ρ_1, \dots, ρ_k\}$ to another such set $\{\hatρ_1, \dots, \hatρ_k\}$. In other words, we must find a {\em completely positive linear map}, if it exists, that maps a given set of density matrices to another given set of density matrices. This problem, in turn, is an instance of a positive semi-definite feasibility problem, but with highly structured constraints. The nature of the constraints makes projection based algorithms very appealing when the number of variables is huge and standard interior point-methods for semi-definite programming are not applicable. We provide emperical evidence to this effect. We moreover present heuristics for finding both high rank and low rank solutions. Our experiments are based on the \emph{method of alternating projections} and the \emph{Douglas-Rachford} reflection method.

preprint2013arXiv

Orbits of geometric descent

We prove that quasiconvex functions always admit descent trajectories bypassing all non-minimizing critical points.

preprint2013arXiv

Orthogonal Invariance and Identifiability

Orthogonally invariant functions of symmetric matrices often inherit properties from their diagonal restrictions: von Neumann's theorem on matrix norms is an early example. We discuss the example of "identifiability", a common property of nonsmooth functions associated with the existence of a smooth manifold of approximate critical points. Identifiability (or its synonym, "partial smoothness") is the key idea underlying active set methods in optimization. Polyhedral functions, in particular, are always partly smooth, and hence so are many standard examples from eigenvalue optimization.

preprint2012arXiv

Clarke subgradients for directionally Lipschitzian stratifiable functions

Using a geometric argument, we show that under a reasonable continuity condition, the Clarke subdifferential of a semi-algebraic (or more generally stratifiable) directionally Lipschitzian function admits a simple form: the normal cone to the domain and limits of gradients generate the entire Clarke subdifferential. The characterization formula we obtain unifies various apparently disparate results that have appeared in the literature. Our techniques also yield a simplified proof that closed semialgebraic functions on $\R^n$ have a limiting subdifferential graph of uniform local dimension $n$.

preprint2012arXiv

Optimality, identifiability, and sensitivity

Around a solution of an optimization problem, an "identifiable" subset of the feasible region is one containing all nearby solutions after small perturbations to the problem. A quest for only the most essential ingredients of sensitivity analysis leads us to consider identifiable sets that are "minimal". This new notion lays a broad and intuitive variational-analytic foundation for optimality conditions, sensitivity, and active set methods.

preprint2012arXiv

Tilt stability, uniform quadratic growth, and strong metric regularity of the subdifferential

We prove that uniform second order growth, tilt stability, and strong metric regularity of the limiting subdifferential --- three notions that have appeared in entirely different settings --- are all essentially equivalent for any lower-semicontinuous, extended-real-valued function.

preprint2011arXiv

Complexity of a Single Face in an Arrangement of s-Intersecting Curves

Consider a face F in an arrangement of n Jordan curves in the plane, no two of which intersect more than s times. We prove that the combinatorial complexity of F is O(λ_s(n)), O(λ_{s+1}(n)), and O(λ_{s+2}(n)), when the curves are bi-infinite, semi-infinite, or bounded, respectively; λ_k(n) is the maximum length of a Davenport-Schinzel sequence of order k on an alphabet of n symbols. Our bounds asymptotically match the known worst-case lower bounds. Our proof settles the still apparently open case of semi-infinite curves. Moreover, it treats the three cases in a fairly uniform fashion.

preprint2011arXiv

Semi-algebraic functions have small subdifferentials

We prove that the subdifferential of any semi-algebraic extended-real-valued function on $\R^n$ has $n$-dimensional graph. We discuss consequences for generic semi-algebraic optimization problems.

preprint2011arXiv

The dimension of semialgebraic subdifferential graphs

Examples exist of extended-real-valued closed functions on ${\bf R}^n$ whose subdifferentials (in the standard, limiting sense) have large graphs. By contrast, if such a function is semi-algebraic, then its subdifferential graph must have everywhere constant local dimension $n$. This result is related to a celebrated theorem of Minty, and surprisingly may fail for the Clarke subdifferential.

preprint2010arXiv

Generic nondegeneracy in convex optimization

We show that minimizers of convex functions subject to almost all linear perturbations are nondegenerate. An analogous result holds more generally, for lower-C^2 functions.

Dmitriy Drusvyatskiy

What is connected

Connect this record

See the researcher in context

Building this map preview

28 published item(s)

Average Gradient Outer Product in kernel regression provably recovers the central subspace for multi-index models

High-dimensional Limit of SGD for Diagonal Linear Networks

When do spectral gradient updates help in deep learning?

Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization

Asymptotic normality and optimality in nonsmooth stochastic approximation

A gradient sampling method with complexity guarantees for Lipschitz functions in high and low dimensions

Decision-Dependent Risk Minimization in Geometrically Decaying Dynamic Environments

Improved Rates for Derivative Free Gradient Play in Strongly Monotone Games

Multiplayer Performative Prediction: Learning in Decision-Dependent Games

Conservative and semismooth derivatives are equivalent for semialgebraic maps

Proximal methods avoid active strict saddles of weakly convex functions

Error bounds, quadratic growth, and linear convergence of proximal methods

Level-set methods for convex optimization

Nonsmooth optimization using Taylor-like models: error bounds, convergence, and termination criteria

The Euclidean Distance Degree of Orthogonally Invariant Matrix Varieties

Counting real critical points of the distance to orthogonally invariant matrix sets

Noisy Euclidean distance realization: robust facial reduction and the Pareto frontier

Sweeping by a tame process

Projection methods in quantum information science

Orbits of geometric descent

Orthogonal Invariance and Identifiability

Clarke subgradients for directionally Lipschitzian stratifiable functions

Optimality, identifiability, and sensitivity

Tilt stability, uniform quadratic growth, and strong metric regularity of the subdifferential

Complexity of a Single Face in an Arrangement of s-Intersecting Curves

Semi-algebraic functions have small subdifferentials

The dimension of semialgebraic subdifferential graphs

Generic nondegeneracy in convex optimization