Researcher profile

Yury Polyanskiy

Yury Polyanskiy contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
18works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

18 published item(s)

preprint2026arXiv

Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

While many approaches to improve VQ-VAE performance focus on codebook size and utilization, the effect of dimensional collapse, where trained VQ-VAE representations live in an extremely low-dimensional subspace (1-2% of full rank), remains unaddressed. We show theoretically and empirically that dimension collapse causes a hard loss lower bound that various codebook improvement techniques fail to surpass. Our analytic framework extends the sequential learning effect of Saxe et al. [2014] by introducing ideas from rate-distortion theory and explains how the latent collapse is caused by the VQ suppressing lower-variance directions. Our theory justifies a simple solution: a "warm-up phase" that trains the model as an (unquantized) autoencoder before introducing VQ. On both synthetic experiments and large-scale image (VQGAN) and audio (WavTokenizer) VQ-VAEs, we show that AE Warm-Up successfully restores representation dimension, leading to lower reconstruction and perceptual loss at the same training budget. Across codebook sizes $K \in$ {$2^{10}, 2^{14}, 2^{16}$}, AE warm-up raises VQGAN codebook effective dimension from 3-5 to 17-19 and reduces rFID by 17-35%; on WavTokenizer at $K \in$ {$2^{13}, 2^{14}$}, it raises codebook dimension from 4 to 17-19 and improves PESQ by 11-14%. We empirically characterize how warm-up duration governs the achievable final loss. In agreement with experiment, our theoretical analysis predicts downstream performance as a function of warm-up length, enabling an adaptive criterion for switching from AE Warm-up to VQ-VAE training.

preprint2026arXiv

High-Rate Quantized Matrix Multiplication II

This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $Σ_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.

preprint2026arXiv

Scaling Limits of Long-Context Transformers

We study the long-context limit of softmax self-attention with a fixed query and a random context of $n$ i.i.d. keys on the sphere, viewing the inverse temperature $β_n$ as the scaling parameter that decides whether attention degenerates into uniform averaging or collapses onto the single closest key. We show that the critical scale at which selectivity emerges is determined by the local exponent of the distance-to-query distribution near zero rather than by global features of the context, and scales like $β_n^\ast \asymp n^{2/(d-1)}$ for uniform keys on $\mathbb{S}^{d-1}$. Furthermore, we characterize the limiting laws of the ordered attention weights and of the attention output across all regimes of $β_n$: a subcritical regime in which the output reduces to a local average around $q$ with explicit deterministic bias and Gaussian fluctuations; a critical regime in which a finite collection of nearest keys retains macroscopic mass without single-key collapse; and a supercritical regime in which all mass concentrates on the closest key. Of notable interest is the subcritical case with identity value matrix where the attention map approximately implements a backward heat equation.

preprint2022arXiv

Intrinsic Dimension Estimation Using Wasserstein Distances

It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data.

preprint2021arXiv

Dualizing Le Cam's method for functional estimation, with applications to estimating the unseens

Le Cam's method (or the two-point method) is a commonly used tool for obtaining statistical lower bound and especially popular for functional estimation problems. This work aims to explain and give conditions for the tightness of Le Cam's lower bound in functional estimation from the perspective of convex duality. Under a variety of settings it is shown that the maximization problem that searches for the best two-point lower bound, upon dualizing, becomes a minimization problem that optimizes the bias-variance tradeoff among a family of estimators. For estimating linear functionals of a distribution our work strengthens prior results of Donoho-Liu \cite{DL91} (for quadratic loss) by dropping the Hölderian assumption on the modulus of continuity. For exponential families our results extend those of Juditsky-Nemirovski \cite{JN09} by characterizing the minimax risk for the quadratic loss under weaker assumptions on the exponential family. We also provide an extension to the high-dimensional setting for estimating separable functionals. Notably, coupled with tools from complex analysis, this method is particularly effective for characterizing the ``elbow effect'' -- the phase transition from parametric to nonparametric rates. As the main application we derive sharp minimax rates in the Distinct elements problem (given a fraction $p$ of colored balls from an urn containing $d$ balls, the optimal error of estimating the number of distinct colors is $\tilde Θ(d^{-\frac{1}{2}\min\{\frac{p}{1-p},1\}})$) and the Fisher's species problem (given $n$ iid observations from an unknown distribution, the optimal prediction error of the number of unseen symbols in the next (unobserved) $r \cdot n$ observations is $\tilde Θ(n^{-\min\{\frac{1}{r+1},\frac{1}{2}\}})$).

preprint2021arXiv

Stochastic block model entropy and broadcasting on trees with survey

The limit of the entropy in the stochastic block model (SBM) has been characterized in the sparse regime for the special case of disassortative communities [COKPZ17] and for the classical case of assortative communities but in the dense regime [DAM16]. The problem has not been closed in the classical sparse and assortative case. This paper establishes the result in this case for any SNR besides for the interval (1, 3.513). It further gives an approximation to the limit in this window. The result is obtained by expressing the global SBM entropy as an integral of local tree entropies in a broadcasting on tree model with erasure side-information. The main technical advancement then relies on showing the irrelevance of the boundary in such a model, also studied with variants in [KMS16], [MNS16] and [MX15]. In particular, we establish the uniqueness of the BP fixed point in the survey model for any SNR above 3.513 or below 1. This only leaves a narrow region in the plane between SNR and survey strength where the uniqueness of BP conjectured in these papers remains unproved.

preprint2020arXiv

A Note on the Probability of Rectangles for Correlated Binary Strings

Consider two sequences of $n$ independent and identically distributed fair coin tosses, $X=(X_1,\ldots,X_n)$ and $Y=(Y_1,\ldots,Y_n)$, which are $ρ$-correlated for each $j$, i.e. $\mathbb{P}[X_j=Y_j] = {1+ρ\over 2}$. We study the question of how large (small) the probability $\mathbb{P}[X \in A, Y\in B]$ can be among all sets $A,B\subset\{0,1\}^n$ of a given cardinality. For sets $|A|,|B| = Θ(2^n)$ it is well known that the largest (smallest) probability is approximately attained by concentric (anti-concentric) Hamming balls, and this can be proved via the hypercontractive inequality (reverse hypercontractivity). Here we consider the case of $|A|,|B| = 2^{Θ(n)}$. By applying a recent extension of the hypercontractive inequality of Polyanskiy-Samorodnitsky (J. Functional Analysis, 2019), we show that Hamming balls of the same size approximately maximize $\mathbb{P}[X \in A, Y\in B]$ in the regime of $ρ\to 1$. We also prove a similar tight lower bound, i.e. show that for $ρ\to 0$ the pair of opposite Hamming balls approximately minimizes the probability $\mathbb{P}[X \in A, Y\in B]$.

preprint2020arXiv

Application of information-percolation method to reconstruction problems on graphs

In this paper we propose a method of proving impossibility results based on applying strong data-processing inequalities to estimate mutual information between sets of variables forming certain Markov random fields. The end result is that mutual information between two "far away" (as measured by the graph distance) variables is bounded by the probability of the existence of an open path in a bond-percolation problem on the same graph. Furthermore, stronger bounds can be obtained by establishing mutual information comparison results with an erasure model on the same graph, with erasure probabilities given by the contraction coefficients. As applications, we show that our method gives sharp threshold for partially recovering a rank-one perturbation of a random Gaussian matrix (spiked Wigner model), yields the best known upper bound on the noise level for group synchronization (obtained concurrently by Abbe and Boix), and establishes new impossibility result for community detection on the stochastic block model with $k$ communities.

preprint2020arXiv

Attracting Random Walks

This paper introduces the Attracting Random Walks model, which describes the dynamics of a system of particles on a graph with $n$ vertices. At each step, a single particle moves to an adjacent vertex (or stays at the current one) with probability proportional to the exponent of the number of other particles at a vertex. From an applied standpoint, the model captures the rich get richer phenomenon. We show that the Markov chain exhibits a phase transition in mixing time, as the parameter governing the attraction is varied. Namely, mixing time is $O(n\log n)$ when the temperature is sufficiently high and $\exp(Ω(n))$ when temperature is sufficiently low. When $\mathcal{G}$ is the complete graph, the model is a projection of the Potts model, whose mixing properties and the critical temperature have been known previously. However, for any other graph our model is non-reversible and does not seem to admit a simple Gibbsian description of a stationary distribution. Notably, we demonstrate existence of the dynamic phase transition without decomposing the stationary distribution into phases.

preprint2020arXiv

Broadcasting on Random Directed Acyclic Graphs

We study a generalization of the well-known model of broadcasting on trees. Consider a directed acyclic graph (DAG) with a unique source vertex $X$, and suppose all other vertices have indegree $d\geq 2$. Let the vertices at distance $k$ from $X$ be called layer $k$. At layer $0$, $X$ is given a random bit. At layer $k\geq 1$, each vertex receives $d$ bits from its parents in layer $k-1$, which are transmitted along independent binary symmetric channel edges, and combines them using a $d$-ary Boolean processing function. The goal is to reconstruct $X$ with probability of error bounded away from $1/2$ using the values of all vertices at an arbitrarily deep layer. This question is closely related to models of reliable computation and storage, and information flow in biological networks. In this paper, we analyze randomly constructed DAGs, for which we show that broadcasting is only possible if the noise level is below a certain degree and function dependent critical threshold. For $d\geq 3$, and random DAGs with layer sizes $Ω(\log k)$ and majority processing functions, we identify the critical threshold. For $d=2$, we establish a similar result for NAND processing functions. We also prove a partial converse for odd $d\geq 3$ illustrating that the identified thresholds are impossible to improve by selecting different processing functions if the decoder is restricted to using a single vertex. Finally, for any noise level, we construct explicit DAGs (using expander graphs) with bounded degree and layer sizes $Θ(\log k)$ admitting reconstruction. In particular, we show that such DAGs can be generated in deterministic quasi-polynomial time or randomized polylogarithmic time in the depth. These results portray a doubly-exponential advantage for storing a bit in DAGs compared to trees, where $d=1$ but layer sizes must grow exponentially with depth in order to enable broadcasting.

preprint2020arXiv

Broadcasting on trees near criticality

We revisit the problem of broadcasting on $d$-ary trees: starting from a Bernoulli$(1/2)$ random variable $X_0$ at a root vertex, each vertex forwards its value across binary symmetric channels $\mathrm{BSC}_δ$ to $d$ descendants. The goal is to reconstruct $X_0$ given the vector $X_{L_h}$ of values of all variables at depth $h$. It is well known that reconstruction (better than a random guess) is possible as $h\to \infty$ if and only if $δ< δ_c(d)$. In this paper, we study the behavior of the mutual information and the probability of error when $δ$ is slightly subcritical. The innovation of our work is application of the recently introduced &#34;less-noisy&#34; channel comparison techniques. For example, we are able to derive the positive part of the phase transition (reconstructability when $δ<δ_c$) using purely information-theoretic ideas. This is in contrast with previous derivations, which explicitly analyze distribution of the Hamming weight of $X_{L_h}$ (a so-called Kesten-Stigum bound).

preprint2020arXiv

Convergence of Smoothed Empirical Measures with Applications to Entropy Estimation

This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating $P\ast\mathcal{N}_σ$, for $\mathcal{N}_σ\triangleq\mathcal{N}(0,σ^2 \mathrm{I}_d)$, by $\hat{P}_n\ast\mathcal{N}_σ$, where $\hat{P}_n$ is the empirical measure, under different statistical distances. The convergence is examined in terms of the Wasserstein distance, total variation (TV), Kullback-Leibler (KL) divergence, and $χ^2$-divergence. We show that the approximation error under the TV distance and 1-Wasserstein distance ($\mathsf{W}_1$) converges at rate $e^{O(d)}n^{-\frac{1}{2}}$ in remarkable contrast to a typical $n^{-\frac{1}{d}}$ rate for unsmoothed $\mathsf{W}_1$ (and $d\ge 3$). For the KL divergence, squared 2-Wasserstein distance ($\mathsf{W}_2^2$), and $χ^2$-divergence, the convergence rate is $e^{O(d)}n^{-1}$, but only if $P$ achieves finite input-output $χ^2$ mutual information across the additive white Gaussian noise channel. If the latter condition is not met, the rate changes to $ω(n^{-1})$ for the KL divergence and $\mathsf{W}_2^2$, while the $χ^2$-divergence becomes infinite - a curious dichotomy. As a main application we consider estimating the differential entropy $h(P\ast\mathcal{N}_σ)$ in the high-dimensional regime. The distribution $P$ is unknown but $n$ i.i.d samples from it are available. We first show that any good estimator of $h(P\ast\mathcal{N}_σ)$ must have sample complexity that is exponential in $d$. Using the empirical approximation results we then show that the absolute-error risk of the plug-in estimator converges at the parametric rate $e^{O(d)}n^{-\frac{1}{2}}$, thus establishing the minimax rate-optimality of the plug-in. Numerical results that demonstrate a significant empirical superiority of the plug-in approach to general-purpose differential entropy estimators are provided.

preprint2020arXiv

Extrapolating the profile of a finite population

We study a prototypical problem in empirical Bayes. Namely, consider a population consisting of $k$ individuals each belonging to one of $k$ types (some types can be empty). Without any structural restrictions, it is impossible to learn the composition of the full population having observed only a small (random) subsample of size $m = o(k)$. Nevertheless, we show that in the sublinear regime of $m =ω(k/\log k)$, it is possible to consistently estimate in total variation the \emph{profile} of the population, defined as the empirical distribution of the sizes of each type, which determines many symmetric properties of the population. We also prove that in the linear regime of $m=c k$ for any constant $c$ the optimal rate is $Θ(1/\log k)$. Our estimator is based on Wolfowitz&#39;s minimum distance method, which entails solving a linear program (LP) of size $k$. We show that there is a single infinite-dimensional LP whose value simultaneously characterizes the risk of the minimum distance estimator and certifies its minimax optimality. The sharp convergence rate is obtained by evaluating this LP using complex-analytic techniques.

preprint2020arXiv

Massive Access for Future Wireless Communication Systems

Multiple access technology played an important role in wireless communication in the last decades: it increases the capacity of the channel and allows different users to access the system simultaneously. However, the conventional multiple access technology, as originally designed for current human-centric wireless networks, is not scalable for future machine-centric wireless networks. Massive access (studied in the literature under such names as massive-device multiple access, unsourced massive random access, massive connectivity, massive machine-type communication, and many-access channels) exhibits a clean break with current networks by potentially supporting millions of devices in each cellular network. The tremendous growth in the number of connected devices requires a fundamental rethinking of the conventional multiple access technologies in favor of new schemes suited for massive random access. Among the many new challenges arising in this setting, the most relevant are: the fundamental limits of communication from a massive number of bursty devices transmitting simultaneously with short packets, the design of low complexity and energy-efficient massive access coding and communication schemes, efficient methods for the detection of a relatively small number of active users among a large number of potential user devices with sporadic transmission pattern, and the integration of massive access with massive MIMO and other important wireless communication technologies. This paper presents an overview of the concept of massive access wireless communication and of the contemporary research on this important topic.

preprint2020arXiv

Note on approximating the Laplace transform of a Gaussian on a complex disk

In this short note we study how well a Gaussian distribution can be approximated by distributions supported on $[-a,a]$. Perhaps, the natural conjecture is that for large $a$ the almost optimal choice is given by truncating the Gaussian to $[-a,a]$. Indeed, such approximation achieves the optimal rate of $e^{-Θ(a^2)}$ in terms of the $L_\infty$-distance between characteristic functions. However, if we consider the $L_\infty$-distance between Laplace transforms on a complex disk, the optimal rate is $e^{-Θ(a^2 \log a)}$, while truncation still only attains $e^{-Θ(a^2)}$. The optimal rate can be attained by the Gauss-Hermite quadrature. As corollary, we also construct a ``super-flat&#39;&#39; Gaussian mixture of $Θ(a^2)$ components with means in $[-a,a]$ and whose density has all derivatives bounded by $e^{-Ω(a^2 \log(a))}$ in the $O(1)$-neighborhood of the origin.

preprint2020arXiv

Sample complexity of population recovery

The problem of population recovery refers to estimating a distribution based on incomplete or corrupted samples. Consider a random poll of sample size $n$ conducted on a population of individuals, where each pollee is asked to answer $d$ binary questions. We consider one of the two polling impediments: (a) in lossy population recovery, a pollee may skip each question with probability $ε$, (b) in noisy population recovery, a pollee may lie on each question with probability $ε$. Given $n$ lossy or noisy samples, the goal is to estimate the probabilities of all $2^d$ binary vectors simultaneously within accuracy $δ$ with high probability. This paper settles the sample complexity of population recovery. For lossy model, the optimal sample complexity is $\tildeΘ(δ^{-2\max\{\fracε{1-ε},1\}})$, improving the state of the art by Moitra and Saks in several ways: a lower bound is established, the upper bound is improved and the result depends at most on the logarithm of the dimension. Surprisingly, the sample complexity undergoes a phase transition from parametric to nonparametric rate when $ε$ exceeds $1/2$. For noisy population recovery, the sharp sample complexity turns out to be more sensitive to dimension and scales as $\exp(Θ(d^{1/3} \log^{2/3}(1/δ)))$ except for the trivial cases of $ε=0,1/2$ or $1$. For both models, our estimators simply compute the empirical mean of a certain function, which is found by pre-solving a linear program (LP). Curiously, the dual LP can be understood as Le Cam&#39;s method for lower-bounding the minimax risk, thus establishing the statistical optimality of the proposed estimators. The value of the LP is determined by complex-analytic methods.

preprint2020arXiv

Self-regularizing Property of Nonparametric Maximum Likelihood Estimator in Mixture Models

Introduced by Kiefer and Wolfowitz \cite{KW56}, the nonparametric maximum likelihood estimator (NPMLE) is a widely used methodology for learning mixture odels and empirical Bayes estimation. Sidestepping the non-convexity in mixture likelihood, the NPMLE estimates the mixing distribution by maximizing the total likelihood over the space of probability measures, which can be viewed as an extreme form of overparameterization. In this paper we discover a surprising property of the NPMLE solution. Consider, for example, a Gaussian mixture model on the real line with a subgaussian mixing distribution. Leveraging complex-analytic techniques, we show that with high probability the NPMLE based on a sample of size $n$ has $O(\log n)$ atoms (mass points), significantly improving the deterministic upper bound of $n$ due to Lindsay \cite{lindsay1983geometry1}. Notably, any such Gaussian mixture is statistically indistinguishable from a finite one with $O(\log n)$ components (and this is tight for certain mixtures). Thus, absent any explicit form of model selection, NPMLE automatically chooses the right model complexity, a property we term \emph{self-regularization}. Extensions to other exponential families are given. As a statistical application, we show that this structural property can be harnessed to bootstrap existing Hellinger risk bound of the (parametric) MLE for finite Gaussian mixtures to the NPMLE for general Gaussian mixtures, recovering a result of Zhang \cite{zhang2009generalized}.

preprint2020arXiv

The Information Bottleneck Problem and Its Applications in Machine Learning

Inference capabilities of machine learning (ML) systems skyrocketed in recent years, now playing a pivotal role in various aspect of society. The goal in statistical learning is to use data to obtain simple algorithms for predicting a random variable $Y$ from a correlated observation $X$. Since the dimension of $X$ is typically huge, computationally feasible solutions should summarize it into a lower-dimensional feature vector $T$, from which $Y$ is predicted. The algorithm will successfully make the prediction if $T$ is a good proxy of $Y$, despite the said dimensionality-reduction. A myriad of ML algorithms (mostly employing deep learning (DL)) for finding such representations $T$ based on real-world data are now available. While these methods are often effective in practice, their success is hindered by the lack of a comprehensive theory to explain it. The information bottleneck (IB) theory recently emerged as a bold information-theoretic paradigm for analyzing DL systems. Adopting mutual information as the figure of merit, it suggests that the best representation $T$ should be maximally informative about $Y$ while minimizing the mutual information with $X$. In this tutorial we survey the information-theoretic origins of this abstract principle, and its recent impact on DL. For the latter, we cover implications of the IB problem on DL theory, as well as practical algorithms inspired by it. Our goal is to provide a unified and cohesive description. A clear view of current knowledge is particularly important for further leveraging IB and other information-theoretic ideas to study DL models.