Researcher profile

Philippe Rigollet

Philippe Rigollet contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
15works
0followers
12topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

15 published item(s)

preprint2026arXiv

Propagation of Chaos in Contextual Flow Maps

We develop a quantitative statistical theory of transformers in the large-context regime by adopting the abstraction of contextual flow maps (CFMs): dynamical systems that evolve a distinguished token in the presence of a contextual measure across a stack of attention blocks. Within this framework, the finite-context model approximates an idealized infinite-context system in which the contextual measure is replaced by its underlying population, so that the context length $n$ becomes a statistical resource. Exploiting the McKean--Vlasov structure of the dynamics and the classical machinery of propagation of chaos, we establish a forward bound controlling the deviation between the finite- and infinite-context CFMs uniformly along depth, and a backward bound controlling the deviation between the corresponding training trajectories uniformly across iterations of online gradient descent. Both bounds achieve the optimal Wasserstein rate $n^{-1/d}$ for general CFMs and parametric rate $n^{-1/2}$ for a restricted class of CFMs that includes transformers as a special case. The analysis rests on a new Eulerian adjoint formulation of the loss gradient and stability estimates for the resulting forward--adjoint system, both of which may be of independent interest.

preprint2026arXiv

Scaling Limits of Long-Context Transformers

We study the long-context limit of softmax self-attention with a fixed query and a random context of $n$ i.i.d. keys on the sphere, viewing the inverse temperature $β_n$ as the scaling parameter that decides whether attention degenerates into uniform averaging or collapses onto the single closest key. We show that the critical scale at which selectivity emerges is determined by the local exponent of the distance-to-query distribution near zero rather than by global features of the context, and scales like $β_n^\ast \asymp n^{2/(d-1)}$ for uniform keys on $\mathbb{S}^{d-1}$. Furthermore, we characterize the limiting laws of the ordered attention weights and of the attention output across all regimes of $β_n$: a subcritical regime in which the output reduces to a local average around $q$ with explicit deterministic bias and Gaussian fluctuations; a critical regime in which a finite collection of nearest keys retains macroscopic mass without single-key collapse; and a supercritical regime in which all mass concentrates on the closest key. Of notable interest is the subcritical case with identity value matrix where the attention map approximately implements a backward heat equation.

preprint2024arXiv

On the Approximation Accuracy of Gaussian Variational Inference

The main computational challenge in Bayesian inference is to compute integrals against a high-dimensional posterior distribution. In the past decades, variational inference (VI) has emerged as a tractable approximation to these integrals, and a viable alternative to the more established paradigm of Markov Chain Monte Carlo. However, little is known about the approximation accuracy of VI. In this work, we bound the TV error and the mean and covariance approximation error of Gaussian VI in terms of dimension and sample size. Our error analysis relies on a Hermite series expansion of the log posterior whose first terms are precisely cancelled out by the first order optimality conditions associated to the Gaussian VI optimization problem.

preprint2023arXiv

Learning Gaussian Mixtures Using the Wasserstein-Fisher-Rao Gradient Flow

Gaussian mixture models form a flexible and expressive parametric family of distributions that has found applications in a wide variety of applications. Unfortunately, fitting these models to data is a notoriously hard problem from a computational perspective. Currently, only moment-based methods enjoy theoretical guarantees while likelihood-based methods are dominated by heuristics such as Expectation-Maximization that are known to fail in simple examples. In this work, we propose a new algorithm to compute the nonparametric maximum likelihood estimator (NPMLE) in a Gaussian mixture model. Our method is based on gradient descent over the space of probability measures equipped with the Wasserstein-Fisher-Rao geometry for which we establish convergence guarantees. In practice, it can be approximated using an interacting particle system where the weight and location of particles are updated alternately. We conduct extensive numerical experiments to confirm the effectiveness of the proposed algorithm compared not only to classical benchmarks but also to similar gradient descent algorithms with respect to simpler geometries. In particular, these simulations illustrate the benefit of updating both weight and location of the interacting particles.

preprint2022arXiv

An algorithmic solution to the Blotto game using multi-marginal couplings

We describe an efficient algorithm to compute solutions for the general two-player Blotto game on n battlefields with heterogeneous values. While explicit constructions for such solutions have been limited to specific, largely symmetric or homogeneous, setups, this algorithmic resolution covers the most general situation to date: value-asymmetric game with asymmetric budget. The proposed algorithm rests on recent theoretical advances regarding Sinkhorn iterations for matrix and tensor scaling. An important case which had been out of reach of previous attempts is that of heterogeneous but symmetric battlefield values with asymmetric budget. In this case, the Blotto game is constant-sum so optimal solutions exist, and our algorithm samples from an $\varepsilon$-optimal solution in time $\tilde{\mathcal{O}}(n^2 + \varepsilon^{-4})$, independently of budgets and battlefield values. In the case of asymmetric values where optimal solutions need not exist but Nash equilibria do, our algorithm samples from an $\varepsilon$-Nash equilibrium with similar complexity but where implicit constants depend on various parameters of the game such as battlefield values.

preprint2022arXiv

Gaussian discrepancy: a probabilistic relaxation of vector balancing

We introduce a novel relaxation of combinatorial discrepancy called Gaussian discrepancy, whereby binary signings are replaced with correlated standard Gaussian random variables. This relaxation effectively reformulates an optimization problem over the Boolean hypercube into one over the space of correlation matrices. We show that Gaussian discrepancy is a tighter relaxation than the previously studied vector and spherical discrepancy problems, and we construct a fast online algorithm that achieves a version of the Banaszczyk bound for Gaussian discrepancy. This work also raises new questions such as the Komlós conjecture for Gaussian discrepancy, which may shed light on classical discrepancy problems.

preprint2022arXiv

On the sample complexity of entropic optimal transport

We study the sample complexity of entropic optimal transport in high dimensions using computationally efficient plug-in estimators. We significantly advance the state of the art by establishing dimension-free, parametric rates for estimating various quantities of interest, including the entropic regression function which is a natural analog to the optimal transport map. As an application, we propose a practical model for transfer learning based on entropic optimal transport and establish parametric rates of convergence for nonparametric regression and classification.

preprint2022arXiv

Sparse Multi-Reference Alignment : Phase Retrieval, Uniform Uncertainty Principles and the Beltway Problem

Motivated by cutting-edge applications like cryo-electron microscopy (cryo-EM), the Multi-Reference Alignment (MRA) model entails the learning of an unknown signal from repeated measurements of its images under the latent action of a group of isometries and additive noise of magnitude $σ$. Despite significant interest, a clear picture for understanding rates of estimation in this model has emerged only recently, particularly in the high-noise regime $σ\gg 1$ that is highly relevant in applications. Recent investigations have revealed a remarkable asymptotic sample complexity of order $σ^6$ for certain signals whose Fourier transforms have full support, in stark contrast to the traditional $σ^2$ that arise in regular models. Often prohibitively large in practice, these results have prompted the investigation of variations around the MRA model where better sample complexity may be achieved. In this paper, we show that sparse signals exhibit an intermediate $σ^4$ sample complexity even in the classical MRA model. Further, we characterise the dependence of the estimation rate on the support size $s$ as $O_p(1)$ and $O_p(s^{3.5})$ in the dilute and moderate regimes of sparsity respectively. Our techniques have implications for the problem of crystallographic phase retrieval, indicating a certain local uniqueness for the recovery of sparse signals from their power spectrum. Our results explore and exploit connections of the MRA estimation problem with two classical topics in applied mathematics: the beltway problem from combinatorial optimization, and uniform uncertainty principles from harmonic analysis. Our techniques include a certain enhanced form of the probabilistic method, which might be of general interest in its own right.

preprint2020arXiv

Exponential ergodicity of mirror-Langevin diffusions

Motivated by the problem of sampling from ill-conditioned log-concave distributions, we give a clean non-asymptotic convergence analysis of mirror-Langevin diffusions as introduced in Zhang et al. (2020). As a special case of this framework, we propose a class of diffusions called Newton-Langevin diffusions and prove that they converge to stationarity exponentially fast with a rate which not only is dimension-free, but also has no dependence on the target distribution. We give an application of this result to the problem of sampling from the uniform distribution on a convex body using a strategy inspired by interior-point methods. Our general approach follows the recent trend of linking sampling and optimization and highlights the role of the chi-squared divergence. In particular, it yields new results on the convergence of the vanilla Langevin diffusion in Wasserstein distance.

preprint2020arXiv

Gradient descent algorithms for Bures-Wasserstein barycenters

We study first order methods to compute the barycenter of a probability distribution $P$ over the space of probability measures with finite second moment. We develop a framework to derive global rates of convergence for both gradient descent and stochastic gradient descent despite the fact that the barycenter functional is not geodesically convex. Our analysis overcomes this technical hurdle by employing a Polyak-Lojasiewicz (PL) inequality and relies on tools from optimal transport and metric geometry. In turn, we establish a PL inequality when $P$ is supported on the Bures-Wasserstein manifold of Gaussian probability measures. It leads to the first global rates of convergence for first order methods in this context.

preprint2020arXiv

Minimax estimation of smooth optimal transport maps

Brenier's theorem is a cornerstone of optimal transport that guarantees the existence of an optimal transport map $T$ between two probability distributions $P$ and $Q$ over $\mathbb{R}^d$ under certain regularity conditions. The main goal of this work is to establish the minimax estimation rates for such a transport map from data sampled from $P$ and $Q$ under additional smoothness assumptions on $T$. To achieve this goal, we develop an estimator based on the minimization of an empirical version of the semi-dual optimal transport problem, restricted to truncated wavelet expansions. This estimator is shown to achieve near minimax optimality using new stability arguments for the semi-dual and a complementary minimax lower bound. Furthermore, we provide numerical experiments on synthetic data supporting our theoretical findings and highlighting the practical benefits of smoothness regularization. These are the first minimax estimation rates for transport maps in general dimension.

preprint2020arXiv

Optimal Rates for Estimation of Two-Dimensional Totally Positive Distributions

We study minimax estimation of two-dimensional totally positive distributions. Such distributions pertain to pairs of strongly positively dependent random variables and appear frequently in statistics and probability. In particular, for distributions with $β$-Hölder smooth densities where $β\in (0, 2)$, we observe polynomially faster minimax rates of estimation when, additionally, the total positivity condition is imposed. Moreover, we demonstrate fast algorithms to compute the proposed estimators and corroborate the theoretical rates of estimation by simulation studies.

preprint2020arXiv

Power analysis of knockoff filters for correlated designs

The knockoff filter introduced by Barber and Candès 2016 is an elegant framework for controlling the false discovery rate in variable selection. While empirical results indicate that this methodology is not too conservative, there is no conclusive theoretical result on its power. When the predictors are i.i.d. Gaussian, it is known that as the signal to noise ratio tend to infinity, the knockoff filter is consistent in the sense that one can make FDR go to 0 and power go to 1 simultaneously. In this work we study the case where the predictors have a general covariance matrix $Σ$. We introduce a simple functional called effective signal deficiency (ESD) of the covariance matrix $Σ$ that predicts consistency of various variable selection methods. In particular, ESD reveals that the structure of the precision matrix $Σ^{-1}$ plays a central role in consistency and therefore, so does the conditional independence structure of the predictors. To leverage this connection, we introduce Conditional Independence knockoff, a simple procedure that is able to compete with the more sophisticated knockoff filters and that is defined when the predictors obey a Gaussian tree graphical models (or when the graph is sufficiently sparse). Our theoretical results are supported by numerical evidence on synthetic data.

preprint2020arXiv

Projection to Fairness in Statistical Learning

In the context of regression, we consider the fundamental question of making an estimator fair while preserving its prediction accuracy as much as possible. To that end, we define its projection to fairness as its closest fair estimator in a sense that reflects prediction accuracy. Our methodology leverages tools from optimal transport to construct efficiently the projection to fairness of any given estimator as a simple post-processing step. Moreover, our approach precisely quantifies the cost of fairness, measured in terms of prediction accuracy.

preprint2020arXiv

SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence

Stein Variational Gradient Descent (SVGD), a popular sampling algorithm, is often described as the kernelized gradient flow for the Kullback-Leibler divergence in the geometry of optimal transport. We introduce a new perspective on SVGD that instead views SVGD as the (kernelized) gradient flow of the chi-squared divergence which, we show, exhibits a strong form of uniform exponential ergodicity under conditions as weak as a Poincaré inequality. This perspective leads us to propose an alternative to SVGD, called Laplacian Adjusted Wasserstein Gradient Descent (LAWGD), that can be implemented from the spectral decomposition of the Laplacian operator associated with the target density. We show that LAWGD exhibits strong convergence guarantees and good practical performance.