Source author record

Elliot Paquette

Elliot Paquette appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.PR Machine Learning math.CO math.OC math.AT math.ST Statistics Theory math.AP math.DS math.MG math.NA Numerical Analysis

Catalog footprint

What is connected

22works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Phases of Muon: When Muon Eclipses SignSGD

Recently, Muon and related spectral optimizers have demonstrated strong empirical performance as scalable stochastic methods, often outperforming Adam. Yet their behaviour remains poorly understood. We analyze stochastic spectral optimizers, including Muon, on a high-dimensional matrix-valued least squares problem. We derive explicit deterministic dynamics that provide a tractable framework for studying learning behaviour with a focus on (stochastic) SignSVD, which Muon approximates, and (stochastic) SignSGD, the latter serving as a proxy for Adam. Our analysis shows that for large batch size, SignSVD performs a square-root preconditioning with respect to the data covariance spectrum, while for small batch size smaller eigenmodes behave like SGD, slowing down convergence. We contrast with SignSGD which for generic covariance performs no preconditioning and has no transition, leading to different optimal learning rates and convergence characteristics. The two methods match up to a constant factor with isotropic data, but behave differently with anisotropic data. An analysis of a power law covariance model with data exponent $α$ and target exponent $β$ shows there are three phases in the $(α,β)$ plane: one where SignSGD is uniformly favored, one where SignSVD is uniformly favored, and a third where the two methods exhibit a trade-off in performance.

preprint2026arXiv

Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded NanoGPT codebase, we introduce an empirical protocol based on activation covariance and per-sample gradient SVD spectra. This dual-view reveals three empirical findings and one mechanistic explanation. First, batch size acts as a latent determinant of representation geometry: runs that reach equal loss settle into systematically distinct activation spectra. Second, the activation covariance tail measured early in training reliably forecasts downstream token efficiency. Third, movement of the activation spectrum head (leading modes), together with gradient spectra, characterizes underlying learning-dynamics changes, separating learning-side architectural improvements from primarily execution-side gains. These predictive and diagnostic signals persist across the 12-, 36-, and 48-layer model tiers. Finally, a mechanistic model proves the main observations and explains how activation covariance spectra correlate with task-aligned feature learning.

preprint2022arXiv

Flatness of the nuclear norm sphere, simultaneous polarization, and uniqueness in nuclear norm minimization

In this paper we establish necessary and sufficient conditions for the existence of line segments (or flats) in the sphere of the nuclear norm via the notion of simultaneous polarization and a refined expression for the subdifferential of the nuclear norm. This is then leveraged to provide (point-based) necessary and sufficient conditions for uniqueness of solutions for minimizing the nuclear norm over an affine manifold. We further establish an alternative set of sufficient conditions for uniqueness, based on the interplay of the subdifferential of the nuclear norm and the range of the problem-defining linear operator. Finally, using convex duality, we show how to transfer the uniqueness results for the original problem to a whole class of nuclear norm-regularized minimization problems with a strictly convex fidelity term.

preprint2022arXiv

Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties

We develop a stochastic differential equation, called homogenized SGD, for analyzing the dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares problem with $\ell^2$-regularization. We show that homogenized SGD is the high-dimensional equivalence of SGD -- for any quadratic statistic (e.g., population risk with quadratic loss), the statistic under the iterates of SGD converges to the statistic under homogenized SGD when the number of samples $n$ and number of features $d$ are polynomially related ($d^c < n < d^{1/c}$ for some $c > 0$). By analyzing homogenized SGD, we provide exact non-asymptotic high-dimensional expressions for the generalization performance of SGD in terms of a solution of a Volterra integral equation. Further we provide the exact value of the limiting excess risk in the case of quadratic losses when trained by SGD. The analysis is formulated for data matrices and target vectors that satisfy a family of resolvent conditions, which can roughly be viewed as a weak (non-quantitative) form of delocalization of sample-side singular vectors of the data. Several motivating applications are provided including sample covariance matrices with independent samples and random features with non-generative model targets.

preprint2022arXiv

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem. Even in the simple setting of convex quadratic problems, worst-case analyses give an asymptotic convergence rate for SGD that is no better than full-batch gradient descent (GD), and the purported implicit regularization effects of SGD lack a precise explanation. In this work, we study the dynamics of multi-pass SGD on high-dimensional convex quadratics and establish an asymptotic equivalence to a stochastic differential equation, which we call homogenized stochastic gradient descent (HSGD), whose solutions we characterize explicitly in terms of a Volterra integral equation. These results yield precise formulas for the learning and risk trajectories, which reveal a mechanism of implicit conditioning that explains the efficiency of SGD relative to GD. We also prove that the noise from SGD negatively impacts generalization performance, ruling out the possibility of any type of implicit regularization in this context. Finally, we show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD (bootstrap risk).

preprint2022arXiv

Spectra of Overlapping Wishart Matrices and the Gaussian Free Field

Consider a doubly-infinite array of iid centered variables with moment conditions, from which one can extract a finite number of rectangular, overlapping submatrices, and form the corresponding Wishart matrices. We show that under basic smoothness assumptions, centered linear eigenstatistics of such matrices converge jointly to a Gaussian vector with an interesting covariance structure. This structure, which is similar to those appearing in work of Borodin, Borodin and Gorin, and Johnson and Pal can be described in terms of the height function, and leads to a connection with the Gaussian Free Field on the upper half-plane. Finally, we generalize our results from univariate polynomials to a special class of planar functions.

preprint2022arXiv

Strong approximation of Gaussian beta-ensemble characteristic polynomials: the hyperbolic regime

We investigate the characteristic polynomials $φ_N$ of the Gaussian $β$-ensemble for general $β>0$ through its transfer matrix recurrence. Our motivation is to obtain a (probabilistic) approximation for $φ_N$ in terms of a Gaussian log--correlated field in order to ultimately deduce some of its fine asymptotic properties. We distinguish between different types of transfer matrices and analyze completely the hyperbolic regime of the recurrence. As a result, we obtain a new coupling between $φ_N(z)$ and a Gaussian analytic function with an error which is uniform for $z \in \mathbb{C}$ separated from the support of the semicircle law. We use this as input to give the almost sure scaling limit of the characteristic polynomial at the edge in arXiv:2009.05003. This is also required to obtain analogous strong approximations inside of the bulk of the semicircle law. Our analysis relies on moderate deviation estimates for the product of transfer matrices and this approach might also be useful in different contexts.

preprint2022arXiv

Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions

We analyze the dynamics of large batch stochastic gradient descent with momentum (SGD+M) on the least squares problem when both the number of samples and dimensions are large. In this setting, we show that the dynamics of SGD+M converge to a deterministic discrete Volterra equation as dimension increases, which we analyze. We identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of $\mathcal{O}(1/\sqrtκ)$, matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance.

preprint2021arXiv

SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality

We propose a new framework, inspired by random matrix theory, for analyzing the dynamics of stochastic gradient descent (SGD) when both number of samples and dimensions are large. This framework applies to any fixed stepsize and the finite sum setting. Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit. Furthermore, the limiting dynamics are governed by a Volterra integral equation. This model predicts that SGD undergoes a phase transition at an explicitly given critical stepsize that ultimately affects its convergence rate, which we also verify experimentally. Finally, when input data is isotropic, we provide explicit expressions for the dynamics and average-case convergence rates (i.e., the complexity of an algorithm averaged over all possible inputs). These rates show significant improvement over the worst-case complexities.

preprint2020arXiv

Interval fragmentations with choice: equidistribution and the evolution of tagged fragments

We consider a Markovian evolution on point processes, the $Ψ$--process, on the unit interval in which points are added according to a rule that depends only on the spacings of the existing point configuration. Having chosen a spacing, a new point is added uniformly within it. Building on previous work of the authors and of Junge, we show that the empirical distribution of points in such a process is always equidistributed under mild assumptions on the rule, generalizing work of Junge. A major portion of this article is devoted to the study of a particular growth--fragmentation process, or cell process, which is a type of piecewise--deterministic Markov process (PDMP). This process represents a linearized version of a size--biased sampling from the $Ψ$--process. We show that this PDMP is ergodic and develop the semigroup theory of it, to show that it describes a linearized version of the $Ψ$--process. This PDMP has appeared in other contexts, and in some sense we develop its theory under minimal assumptions.

preprint2020arXiv

Topology of random 2-dimensional cubical complexes

We study a natural model of random 2-dimensional cubical complex which is a subcomplex of an n-dimensional cube, and where every possible square $2$-face is included independently with probability p. Our main result is to exhibit a sharp threshold p=1/2 for homology vanishing as $n \to \infty$. This is a 2-dimensional analogue of the Burtin and Erdős-Spencer theorems characterizing the connectivity threshold for random cubical graphs. Our main result can also be seen as a cubical counterpart to the Linial--Meshulam theorem for random 2-dimensional simplicial complexes. However, the models exhibit strikingly different behaviors. We show that if $p > 1 - \sqrt{1/2} \approx 0.2929$, then with high probability the fundamental group is a free group with one generator for every maximal $1$-dimensional face. As a corollary, homology vanishing and simple connectivity have the same threshold, even in the strong "hitting time" sense. This is in contrast with the simplicial case, where the thresholds are far apart. The proof depends on an iterative algorithm for contracting cycles -- we show that with high probability the algorithm rapidly and dramatically simplifies the fundamental group, converging after only a few steps.

preprint2020arXiv

Universality for the conjugate gradient and MINRES algorithms on sample covariance matrices

We present a probabilistic analysis of two Krylov subspace methods for solving linear systems. We prove a central limit theorem for norms of the residual vectors that are produced by the conjugate gradient and MINRES algorithms when applied to a wide class of sample covariance matrices satisfying some standard moment conditions. The proof involves establishing a four moment theorem for the so-called spectral measure, implying, in particular, universality for the matrix produced by the Lanczos iteration. The central limit theorem then implies an almost-deterministic iteration count for the iterative methods in question.

preprint2019arXiv

On the speed of distance stationary sequences

We prove a formula for the speed of distance stationary random sequences generalizing the law of large numbers of Karlsson and Ledrappier. A particular case is the classical formula for the largest Lyapunov exponent of i.i.d.\ matrix products, but our result has applications in various different contexts. In many situations it gives a method to estimate the speed, and in others it allows to obtain results of dimension drop for escape measures related to random walks. We show applications to stationary reversible random trees with conductances, Bernoulli bond percolation of Cayley graphs, and random walks on cocompact Fuchsian groups.

preprint2015arXiv

Birkhoff sum fluctuations in substitution dynamical systems

We consider the deviation of Birkhoff sums along fixed orbits of substitution dynamical systems. We show distributional convergence for the Birkhoff sums of eigenfunctions of the substitution matrix. For noncoboundary eigenfunctions with eigenvalue of modulus 1, we obtain a central limit theorem. For other eigenfunctions, we show convergence to distributions supported on Cantor sets. We also give a new criterion for such an eigenfunction to be a coboundary, as well as a new characterization of substitution dynamical systems with bounded discrepancy

preprint2015arXiv

Extremal eigenvalue fluctuations in the GUE minor process and the law of fractional logarithm

We consider the GUE minor process, where a sequence of GUE matrices is drawn from the corner of a doubly infinite array of i.i.d. standard normal variables subject to the symmetry constraint. From each matrix, we take its largest eigenvalue, appropriately rescaled to converge to the standard Tracy-Widom distribution. We show the analogue of the law of iterated logarithm for this sequence, i.e. we divide the normalized n-th eigenvalue by a logarithmic factor and show the limsup of this sequence is a constant almost surely. We also give almost sure bounds for the appropriately scaled liminf.

preprint2015arXiv

Quantitative Small Subgraph Conditioning

We revisit the method of small subgraph conditioning, used to establish that random regular graphs are Hamiltonian a.a.s. We refine this method using new technical machinery for random $d$-regular graphs on $n$ vertices that hold not just asymptotically, but for any values of $d$ and $n$. This lets us estimate how quickly the probability of containing a Hamiltonian cycle converges to 1, and it produces quantitative contiguity results between different models of random regular graphs. These results hold with $d$ held fixed or growing to infinity with $n$. As additional applications, we establish the distributional convergence of the number of Hamiltonian cycles when $d$ grows slowly to infinity, and we prove that the number of Hamiltonian cycles can be approximately computed from the graph's eigenvalues for almost all regular graphs.

preprint2014arXiv

Anchored expansion, speed, and the hyperbolic Poisson Voronoi tessellation

We show that random walk on a stationary random graph with positive anchored expansion and exponential volume growth has positive speed. We also show that two families of random triangulations of the hyperbolic plane, the hyperbolic Poisson Voronoi tessellation and the hyperbolic Poisson Delaunay triangulation, have 1-skeletons with positive anchored expansion. As a consequence, we show that the simple random walks on these graphs have positive speed. We include a section of open problems and conjectures on the topics of stationary geometric random graphs and the hyperbolic Poisson Voronoi tessellation.

preprint2014arXiv

Regularization of non-normal matrices by Gaussian noise

We consider the regularization of matrices $M^N$ written in Jordan form by additive Gaussian noise $N^{-γ}G^N$, where $G^N$ is a matrix of i.i.d. standard Gaussians and $γ>1/2$ so that the operator norm of the additive noise tends to $0$ with $N$. Under mild conditions on the structure of $M^N$ we evaluate the limit of the empirical measure of eigenvalues of $M^N+N^{-γ} G^N$ and show that it depends on $γ$, in contrast with the case of a single Jordan block.

preprint2014arXiv

The power of 2 choices over preferential attachment

We introduce a new type of preferential attachment tree that includes choices in its evolution, like with Achlioptas processes. At each step in the growth of the graph, a new vertex is introduced. Two possible neighbor vertices are selected independently and with probability proportional to degree. Between the two, the vertex with smaller degree is chosen, and a new edge is created. We determine with high probability the largest degree of this graph up to some additive error term.

preprint2014arXiv

The power of choice combined with preferential attachment

We prove almost sure convergence of the maximum degree in an evolving tree model combining local choice and preferential attachment. At each step in the growth of the graph, a new vertex is introduced. A fixed, finite number of possible neighbors are sampled from the existing vertices with probability proportional to degree. Of these possibilities, the vertex with the largest degree is chosen. The maximal degree in this model has linear or near-linear behavior. This contrasts sharply with what is seen in the same choice model without preferential attachment. The proof is based showing the tree has a persistent hub by comparison with the standard preferential attachment model, as well as martingale and random walk arguments.

preprint2014arXiv

The threshold for integer homology in random d-complexes

Let Y ~ Y_d(n,p) denote the Bernoulli random d-dimensional simplicial complex. We answer a question of Linial and Meshulam from 2003, showing that the threshold for vanishing of homology H_{d-1}(Y; Z) is less than 80d log n / n. This bound is tight, up to a constant factor.

preprint2012arXiv

Global Fluctuations for Linear Statistics of β-Jacobi Ensembles

We study the global fluctuations for linear statistics of the form $\sum_{i=1}^n f(λ_i)$ as $n \rightarrow \infty$, for $C^1$ functions $f$, and $λ_1, ..., λ_n$ being the eigenvalues of a (general) $β$-Jacobi ensemble, for which tridiagonal models were given by Killip and Nenciu as well as Edelman and Sutton. The fluctuation from the mean ($\sum_{i=1}^n f(λ_i) - \Exp \sum_{i=1}^n f(λ_i)$) is given asymptotically by a Gaussian process. We compute the covariance matrix for the process and show that it is diagonalized by a shifted Chebyshev polynomial basis; in addition, we analyze the deviation from the predicted mean for polynomial test functions, and we obtain a law of large numbers.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint

Fields this researcher appears in

math.PR Machine Learning math.CO math.OC math.AT math.ST Statistics Theory math.AP math.DS math.MG math.NA Numerical Analysis

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2605.05683:author:2:elliot-paquette

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.09552:author:1:elliot-paquette

Imported May 20, 2026Synced May 21, 2026

5 works

Courtney Paquette

Researcher

Courtney Paquette contributes to research discovery and scholarly infrastructure.

Open to collaborate

2 works

Ben Adlam

Researcher

Ben Adlam contributes to research discovery and scholarly infrastructure.

Open to collaborate

2 works

Ioana Dumitriu

Researcher

Ioana Dumitriu contributes to research discovery and scholarly infrastructure.

Open to collaborate

2 works

Jeffrey Pennington

Researcher

Jeffrey Pennington contributes to research discovery and scholarly infrastructure.

Open to collaborate

Elliot Paquette

What is connected

Connect this record

See the researcher in context

Building this map preview

22 published item(s)

Phases of Muon: When Muon Eclipses SignSGD

Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

Flatness of the nuclear norm sphere, simultaneous polarization, and uniqueness in nuclear norm minimization

Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Spectra of Overlapping Wishart Matrices and the Gaussian Free Field

Strong approximation of Gaussian beta-ensemble characteristic polynomials: the hyperbolic regime

Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions

SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality

Interval fragmentations with choice: equidistribution and the evolution of tagged fragments

Topology of random 2-dimensional cubical complexes

Universality for the conjugate gradient and MINRES algorithms on sample covariance matrices

On the speed of distance stationary sequences

Birkhoff sum fluctuations in substitution dynamical systems

Extremal eigenvalue fluctuations in the GUE minor process and the law of fractional logarithm

Quantitative Small Subgraph Conditioning

Anchored expansion, speed, and the hyperbolic Poisson Voronoi tessellation

Regularization of non-normal matrices by Gaussian noise

The power of 2 choices over preferential attachment

The power of choice combined with preferential attachment

The threshold for integer homology in random d-complexes

Global Fluctuations for Linear Statistics of β-Jacobi Ensembles