Source author record

Jonathan Niles-Weed

Jonathan Niles-Weed appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory math.PR Machine Learning math.OC Computation Data Structures and Algorithms Discrete Mathematics Information Theory math.CA math.CO math.FA math.IT Methodology

Catalog footprint

What is connected

13works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

Optimal transport map estimation in general function spaces

We study the problem of estimating a function $T$ given independent samples from a distribution $P$ and from the pushforward distribution $T_\sharp P$. This setting is motivated by applications in the sciences, where $T$ represents the evolution of a physical system over time, and in machine learning, where, for example, $T$ may represent a transformation learned by a deep neural network trained for a generative modeling task. To ensure identifiability, we assume that $T = \nabla φ_0$ is the gradient of a convex function, in which case $T$ is known as an \emph{optimal transport map}. Prior work has studied the estimation of $T$ under the assumption that it lies in a Hölder class, but general theory is lacking. We present a unified methodology for obtaining rates of estimation of optimal transport maps in general function spaces. Our assumptions are significantly weaker than those appearing in the literature: we require only that the source measure $P$ satisfy a Poincaré inequality and that the optimal map be the gradient of a smooth convex function that lies in a space whose metric entropy can be controlled. As a special case, we recover known estimation rates for Hölder transport maps, but also obtain nearly sharp results in many settings not covered by prior work. For example, we provide the first statistical rates of estimation when $P$ is the normal distribution and the transport map is given by an infinite-width shallow neural network.

preprint2022arXiv

An improved central limit theorem and fast convergence rates for entropic transportation costs

We prove a central limit theorem for the entropic transportation cost between subgaussian probability measures, centered at the population cost. This is the first result which allows for asymptotically valid inference for entropic optimal transport between measures which are not necessarily discrete. In the compactly supported case, we complement these results with new, faster, convergence rates for the expected entropic transportation cost between empirical measures. Our proof is based on strengthening convergence results for dual solutions to the entropic optimal transport problem.

preprint2022arXiv

Debiaser Beware: Pitfalls of Centering Regularized Transport Maps

Estimating optimal transport (OT) maps (a.k.a. Monge maps) between two measures $P$ and $Q$ is a problem fraught with computational and statistical challenges. A promising approach lies in using the dual potential functions obtained when solving an entropy-regularized OT problem between samples $P_n$ and $Q_n$, which can be used to recover an approximately optimal map. The negentropy penalization in that scheme introduces, however, an estimation bias that grows with the regularization strength. A well-known remedy to debias such estimates, which has gained wide popularity among practitioners of regularized OT, is to center them, by subtracting auxiliary problems involving $P_n$ and itself, as well as $Q_n$ and itself. We do prove that, under favorable conditions on $P$ and $Q$, debiasing can yield better approximations to the Monge map. However, and perhaps surprisingly, we present a few cases in which debiasing is provably detrimental in a statistical sense, notably when the regularization strength is large or the number of samples is small. These claims are validated experimentally on synthetic and real datasets, and should reopen the debate on whether debiasing is needed when using entropic optimal transport.

preprint2022arXiv

Distributional Convergence of the Sliced Wasserstein Process

Motivated by the statistical and computational challenges of computing Wasserstein distances in high-dimensional contexts, machine learning researchers have defined modified Wasserstein distances based on computing distances between one-dimensional projections of the measures. Different choices of how to aggregate these projected distances (averaging, random sampling, maximizing) give rise to different distances, requiring different statistical analyses. We define the \emph{Sliced Wasserstein Process}, a stochastic process defined by the empirical Wasserstein distance between projections of empirical probability measures to all one-dimensional subspaces, and prove a uniform distributional limit theorem for this process. As a result, we obtain a unified framework in which to prove distributional limit results for all Wasserstein distances based on one-dimensional projections. We illustrate these results on a number of examples where no distributional limits were previously known.

preprint2022arXiv

On the Second Kahn--Kalai Conjecture

For any given graph $H$, we are interested in $p_\mathrm{crit}(H)$, the minimal $p$ such that the Erdős-Rényi graph $G(n,p)$ contains a copy of $H$ with probability at least $1/2$. Kahn and Kalai (2007) conjectured that $p_\mathrm{crit}(H)$ is given up to a logarithmic factor by a simpler "subgraph expectation threshold" $p_\mathrm{E}(H)$, which is the minimal $p$ such that for every subgraph $H'\subseteq H$, the Erdős-Rényi graph $G(n,p)$ contains \emph{in expectation} at least $1/2$ copies of $H'$. It is trivial that $p_\mathrm{E}(H) \le p_\mathrm{crit}(H)$, and the so-called "second Kahn-Kalai conjecture" states that $p_\mathrm{crit}(H) \lesssim p_\mathrm{E}(H) \log e(H)$ where $e(H)$ is the number of edges in $H$. In this article, we present a natural modification $p_\mathrm{E, new}(H)$ of the Kahn--Kalai subgraph expectation threshold, which we show is sandwiched between $p_\mathrm{E}(H)$ and $p_\mathrm{crit}(H)$. The new definition $p_\mathrm{E, new}(H)$ is based on the simple observation that if $G(n,p)$ contains a copy of $H$ and $H$ contains \emph{many} copies of $H'$, then $G(n,p)$ must also contain \emph{many} copies of $H'$. We then show that $p_\mathrm{crit}(H) \lesssim p_\mathrm{E, new}(H) \log e(H)$, thus proving a modification of the second Kahn--Kalai conjecture. The bound follows by a direct application of the set-theoretic "spread" property, which led to recent breakthroughs in the sunflower conjecture by Alweiss, Lovett, Wu and Zhang and the first fractional Kahn--Kalai conjecture by Frankston, Kahn, Narayanan and Park.

preprint2021arXiv

Asymptotics for semi-discrete entropic optimal transport

We compute exact second-order asymptotics for the cost of an optimal solution to the entropic optimal transport problem in the continuous-to-discrete, or semi-discrete, setting. In contrast to the discrete-discrete or continuous-continuous case, we show that the first-order term in this expansion vanishes but the second-order term does not, so that in the semi-discrete setting the difference in cost between the unregularized and regularized solution is quadratic in the inverse regularization parameter, with a leading constant that depends explicitly on the value of the density at the points of discontinuity of the optimal unregularized map between the measures. We develop these results by proving new pointwise convergence rates of the solutions to the dual problem, which may be of independent interest.

preprint2021arXiv

Dimension-free log-Sobolev inequalities for mixture distributions

We prove that if ${(P_x)}_{x\in \mathscr X}$ is a family of probability measures which satisfy the log-Sobolev inequality and whose pairwise chi-squared divergences are uniformly bounded, and $μ$ is any mixing distribution on $\mathscr X$, then the mixture $\int P_x \, \mathrm{d} μ(x)$ satisfies a log-Sobolev inequality. In various settings of interest, the resulting log-Sobolev constant is dimension-free. In particular, our result implies a conjecture of Zimmermann and Bardet et al. that Gaussian convolutions of measures with bounded support enjoy dimension-free log-Sobolev inequalities.

preprint2021arXiv

Streaming k-PCA: Efficient guarantees for Oja's algorithm, beyond rank-one updates

We analyze Oja's algorithm for streaming $k$-PCA and prove that it achieves performance nearly matching that of an optimal offline algorithm. Given access to a sequence of i.i.d. $d \times d$ symmetric matrices, we show that Oja's algorithm can obtain an accurate approximation to the subspace of the top $k$ eigenvectors of their expectation using a number of samples that scales polylogarithmically with $d$. Previously, such a result was only known in the case where the updates have rank one. Our analysis is based on recently developed matrix concentration tools, which allow us to prove strong bounds on the tails of the random matrices which arise in the course of the algorithm's execution.

preprint2020arXiv

Matrix Concentration for Products

This paper develops nonasymptotic growth and concentration bounds for a product of independent random matrices. These results sharpen and generalize recent work of Henriksen-Ward, and they are similar in spirit to the results of Ahlswede-Winter and of Tropp for a sum of independent random matrices. The argument relies on the uniform smoothness properties of the Schatten trace classes.

preprint2020arXiv

Minimax estimation of smooth densities in Wasserstein distance

We study nonparametric density estimation problems where error is measured in the Wasserstein distance, a metric on probability distributions popular in many areas of statistics and machine learning. We give the first minimax-optimal rates for this problem for general Wasserstein distances, and show that, unlike classical nonparametric density estimation, these rates depend on whether the densities in question are bounded below. Motivated by variational problems involving the Wasserstein distance, we also show how to construct discretely supported measures, suitable for computational purposes, which achieve the minimax rates. Our main technical tool is an inequality giving a nearly tight dual characterization of the Wasserstein distances in terms of Besov norms.

preprint2020arXiv

Sinkhorn EM: An Expectation-Maximization algorithm based on entropic optimal transport

We study Sinkhorn EM (sEM), a variant of the expectation maximization (EM) algorithm for mixtures based on entropic optimal transport. sEM differs from the classic EM algorithm in the way responsibilities are computed during the expectation step: rather than assign data points to clusters independently, sEM uses optimal transport to compute responsibilities by incorporating prior information about mixing weights. Like EM, sEM has a natural interpretation as a coordinate ascent procedure, which iteratively constructs and optimizes a lower bound on the log-likelihood. However, we show theoretically and empirically that sEM has better behavior than EM: it possesses better global convergence guarantees and is less prone to getting stuck in bad local optima. We complement these findings with experiments on simulated data as well as in an inference task involving C. elegans neurons and show that sEM learns cell labels significantly better than other approaches.

preprint2020arXiv

Supervised Quantile Normalization for Low-rank Matrix Approximation

Low rank matrix factorization is a fundamental building block in machine learning, used for instance to summarize gene expression profile data or word-document counts. To be robust to outliers and differences in scale across features, a matrix factorization step is usually preceded by ad-hoc feature normalization steps, such as \texttt{tf-idf} scaling or data whitening. We propose in this work to learn these normalization operators jointly with the factorization itself. More precisely, given a $d\times n$ matrix $X$ of $d$ features measured on $n$ individuals, we propose to learn the parameters of quantile normalization operators that can operate row-wise on the values of $X$ and/or of its factorization $UV$ to improve the quality of the low-rank representation of $X$ itself. This optimization is facilitated by the introduction of a new differentiable quantile normalization operator built using optimal transport, providing new results on top of existing work by (Cuturi et al. 2019). We demonstrate the applicability of these techniques on synthetic and genomics datasets.

preprint2020arXiv

The All-or-Nothing Phenomenon in Sparse Tensor PCA

We study the statistical problem of estimating a rank-one sparse tensor corrupted by additive Gaussian noise, a model also known as sparse tensor PCA. We show that for Bernoulli and Bernoulli-Rademacher distributed signals and \emph{for all} sparsity levels which are sublinear in the dimension of the signal, the sparse tensor PCA model exhibits a phase transition called the \emph{all-or-nothing phenomenon}. This is the property that for some signal-to-noise ratio (SNR) $\mathrm{SNR_c}$ and any fixed $ε>0$, if the SNR of the model is below $\left(1-ε\right)\mathrm{SNR_c}$, then it is impossible to achieve any arbitrarily small constant correlation with the hidden signal, while if the SNR is above $\left(1+ε\right)\mathrm{SNR_c}$, then it is possible to achieve almost perfect correlation with the hidden signal. The all-or-nothing phenomenon was initially established in the context of sparse linear regression, and over the last year also in the context of sparse 2-tensor (matrix) PCA, Bernoulli group testing, and generalized linear models. Our results follow from a more general result showing that for any Gaussian additive model with a discrete uniform prior, the all-or-nothing phenomenon follows as a direct outcome of an appropriately defined "near-orthogonality" property of the support of the prior distribution.

Jonathan Niles-Weed

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

Optimal transport map estimation in general function spaces

An improved central limit theorem and fast convergence rates for entropic transportation costs

Debiaser Beware: Pitfalls of Centering Regularized Transport Maps

Distributional Convergence of the Sliced Wasserstein Process

On the Second Kahn--Kalai Conjecture

Asymptotics for semi-discrete entropic optimal transport

Dimension-free log-Sobolev inequalities for mixture distributions

Streaming k-PCA: Efficient guarantees for Oja's algorithm, beyond rank-one updates

Matrix Concentration for Products

Minimax estimation of smooth densities in Wasserstein distance

Sinkhorn EM: An Expectation-Maximization algorithm based on entropic optimal transport

Supervised Quantile Normalization for Low-rank Matrix Approximation

The All-or-Nothing Phenomenon in Sparse Tensor PCA