Researcher profile

Elliot Paquette

Elliot Paquette contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2026arXiv

Phases of Muon: When Muon Eclipses SignSGD

Recently, Muon and related spectral optimizers have demonstrated strong empirical performance as scalable stochastic methods, often outperforming Adam. Yet their behaviour remains poorly understood. We analyze stochastic spectral optimizers, including Muon, on a high-dimensional matrix-valued least squares problem. We derive explicit deterministic dynamics that provide a tractable framework for studying learning behaviour with a focus on (stochastic) SignSVD, which Muon approximates, and (stochastic) SignSGD, the latter serving as a proxy for Adam. Our analysis shows that for large batch size, SignSVD performs a square-root preconditioning with respect to the data covariance spectrum, while for small batch size smaller eigenmodes behave like SGD, slowing down convergence. We contrast with SignSGD which for generic covariance performs no preconditioning and has no transition, leading to different optimal learning rates and convergence characteristics. The two methods match up to a constant factor with isotropic data, but behave differently with anisotropic data. An analysis of a power law covariance model with data exponent $α$ and target exponent $β$ shows there are three phases in the $(α,β)$ plane: one where SignSGD is uniformly favored, one where SignSVD is uniformly favored, and a third where the two methods exhibit a trade-off in performance.

preprint2026arXiv

Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded NanoGPT codebase, we introduce an empirical protocol based on activation covariance and per-sample gradient SVD spectra. This dual-view reveals three empirical findings and one mechanistic explanation. First, batch size acts as a latent determinant of representation geometry: runs that reach equal loss settle into systematically distinct activation spectra. Second, the activation covariance tail measured early in training reliably forecasts downstream token efficiency. Third, movement of the activation spectrum head (leading modes), together with gradient spectra, characterizes underlying learning-dynamics changes, separating learning-side architectural improvements from primarily execution-side gains. These predictive and diagnostic signals persist across the 12-, 36-, and 48-layer model tiers. Finally, a mechanistic model proves the main observations and explains how activation covariance spectra correlate with task-aligned feature learning.

preprint2022arXiv

Flatness of the nuclear norm sphere, simultaneous polarization, and uniqueness in nuclear norm minimization

In this paper we establish necessary and sufficient conditions for the existence of line segments (or flats) in the sphere of the nuclear norm via the notion of simultaneous polarization and a refined expression for the subdifferential of the nuclear norm. This is then leveraged to provide (point-based) necessary and sufficient conditions for uniqueness of solutions for minimizing the nuclear norm over an affine manifold. We further establish an alternative set of sufficient conditions for uniqueness, based on the interplay of the subdifferential of the nuclear norm and the range of the problem-defining linear operator. Finally, using convex duality, we show how to transfer the uniqueness results for the original problem to a whole class of nuclear norm-regularized minimization problems with a strictly convex fidelity term.

preprint2022arXiv

Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties

We develop a stochastic differential equation, called homogenized SGD, for analyzing the dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares problem with $\ell^2$-regularization. We show that homogenized SGD is the high-dimensional equivalence of SGD -- for any quadratic statistic (e.g., population risk with quadratic loss), the statistic under the iterates of SGD converges to the statistic under homogenized SGD when the number of samples $n$ and number of features $d$ are polynomially related ($d^c < n < d^{1/c}$ for some $c > 0$). By analyzing homogenized SGD, we provide exact non-asymptotic high-dimensional expressions for the generalization performance of SGD in terms of a solution of a Volterra integral equation. Further we provide the exact value of the limiting excess risk in the case of quadratic losses when trained by SGD. The analysis is formulated for data matrices and target vectors that satisfy a family of resolvent conditions, which can roughly be viewed as a weak (non-quantitative) form of delocalization of sample-side singular vectors of the data. Several motivating applications are provided including sample covariance matrices with independent samples and random features with non-generative model targets.

preprint2022arXiv

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem. Even in the simple setting of convex quadratic problems, worst-case analyses give an asymptotic convergence rate for SGD that is no better than full-batch gradient descent (GD), and the purported implicit regularization effects of SGD lack a precise explanation. In this work, we study the dynamics of multi-pass SGD on high-dimensional convex quadratics and establish an asymptotic equivalence to a stochastic differential equation, which we call homogenized stochastic gradient descent (HSGD), whose solutions we characterize explicitly in terms of a Volterra integral equation. These results yield precise formulas for the learning and risk trajectories, which reveal a mechanism of implicit conditioning that explains the efficiency of SGD relative to GD. We also prove that the noise from SGD negatively impacts generalization performance, ruling out the possibility of any type of implicit regularization in this context. Finally, we show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD (bootstrap risk).

preprint2022arXiv

Spectra of Overlapping Wishart Matrices and the Gaussian Free Field

Consider a doubly-infinite array of iid centered variables with moment conditions, from which one can extract a finite number of rectangular, overlapping submatrices, and form the corresponding Wishart matrices. We show that under basic smoothness assumptions, centered linear eigenstatistics of such matrices converge jointly to a Gaussian vector with an interesting covariance structure. This structure, which is similar to those appearing in work of Borodin, Borodin and Gorin, and Johnson and Pal can be described in terms of the height function, and leads to a connection with the Gaussian Free Field on the upper half-plane. Finally, we generalize our results from univariate polynomials to a special class of planar functions.

preprint2022arXiv

Strong approximation of Gaussian beta-ensemble characteristic polynomials: the hyperbolic regime

We investigate the characteristic polynomials $φ_N$ of the Gaussian $β$-ensemble for general $β>0$ through its transfer matrix recurrence. Our motivation is to obtain a (probabilistic) approximation for $φ_N$ in terms of a Gaussian log--correlated field in order to ultimately deduce some of its fine asymptotic properties. We distinguish between different types of transfer matrices and analyze completely the hyperbolic regime of the recurrence. As a result, we obtain a new coupling between $φ_N(z)$ and a Gaussian analytic function with an error which is uniform for $z \in \mathbb{C}$ separated from the support of the semicircle law. We use this as input to give the almost sure scaling limit of the characteristic polynomial at the edge in arXiv:2009.05003. This is also required to obtain analogous strong approximations inside of the bulk of the semicircle law. Our analysis relies on moderate deviation estimates for the product of transfer matrices and this approach might also be useful in different contexts.

preprint2022arXiv

Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions

We analyze the dynamics of large batch stochastic gradient descent with momentum (SGD+M) on the least squares problem when both the number of samples and dimensions are large. In this setting, we show that the dynamics of SGD+M converge to a deterministic discrete Volterra equation as dimension increases, which we analyze. We identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of $\mathcal{O}(1/\sqrtκ)$, matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance.

preprint2021arXiv

SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality

We propose a new framework, inspired by random matrix theory, for analyzing the dynamics of stochastic gradient descent (SGD) when both number of samples and dimensions are large. This framework applies to any fixed stepsize and the finite sum setting. Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit. Furthermore, the limiting dynamics are governed by a Volterra integral equation. This model predicts that SGD undergoes a phase transition at an explicitly given critical stepsize that ultimately affects its convergence rate, which we also verify experimentally. Finally, when input data is isotropic, we provide explicit expressions for the dynamics and average-case convergence rates (i.e., the complexity of an algorithm averaged over all possible inputs). These rates show significant improvement over the worst-case complexities.

preprint2020arXiv

Interval fragmentations with choice: equidistribution and the evolution of tagged fragments

We consider a Markovian evolution on point processes, the $Ψ$--process, on the unit interval in which points are added according to a rule that depends only on the spacings of the existing point configuration. Having chosen a spacing, a new point is added uniformly within it. Building on previous work of the authors and of Junge, we show that the empirical distribution of points in such a process is always equidistributed under mild assumptions on the rule, generalizing work of Junge. A major portion of this article is devoted to the study of a particular growth--fragmentation process, or cell process, which is a type of piecewise--deterministic Markov process (PDMP). This process represents a linearized version of a size--biased sampling from the $Ψ$--process. We show that this PDMP is ergodic and develop the semigroup theory of it, to show that it describes a linearized version of the $Ψ$--process. This PDMP has appeared in other contexts, and in some sense we develop its theory under minimal assumptions.

preprint2020arXiv

Topology of random 2-dimensional cubical complexes

We study a natural model of random 2-dimensional cubical complex which is a subcomplex of an n-dimensional cube, and where every possible square $2$-face is included independently with probability p. Our main result is to exhibit a sharp threshold p=1/2 for homology vanishing as $n \to \infty$. This is a 2-dimensional analogue of the Burtin and Erdős-Spencer theorems characterizing the connectivity threshold for random cubical graphs. Our main result can also be seen as a cubical counterpart to the Linial--Meshulam theorem for random 2-dimensional simplicial complexes. However, the models exhibit strikingly different behaviors. We show that if $p > 1 - \sqrt{1/2} \approx 0.2929$, then with high probability the fundamental group is a free group with one generator for every maximal $1$-dimensional face. As a corollary, homology vanishing and simple connectivity have the same threshold, even in the strong &#34;hitting time&#34; sense. This is in contrast with the simplicial case, where the thresholds are far apart. The proof depends on an iterative algorithm for contracting cycles -- we show that with high probability the algorithm rapidly and dramatically simplifies the fundamental group, converging after only a few steps.

preprint2020arXiv

Universality for the conjugate gradient and MINRES algorithms on sample covariance matrices

We present a probabilistic analysis of two Krylov subspace methods for solving linear systems. We prove a central limit theorem for norms of the residual vectors that are produced by the conjugate gradient and MINRES algorithms when applied to a wide class of sample covariance matrices satisfying some standard moment conditions. The proof involves establishing a four moment theorem for the so-called spectral measure, implying, in particular, universality for the matrix produced by the Lanczos iteration. The central limit theorem then implies an almost-deterministic iteration count for the iterative methods in question.

preprint2019arXiv

On the speed of distance stationary sequences

We prove a formula for the speed of distance stationary random sequences generalizing the law of large numbers of Karlsson and Ledrappier. A particular case is the classical formula for the largest Lyapunov exponent of i.i.d.\ matrix products, but our result has applications in various different contexts. In many situations it gives a method to estimate the speed, and in others it allows to obtain results of dimension drop for escape measures related to random walks. We show applications to stationary reversible random trees with conductances, Bernoulli bond percolation of Cayley graphs, and random walks on cocompact Fuchsian groups.