Source author record

Yury Polyanskiy

Yury Polyanskiy appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT math.ST Statistics Theory Machine Learning math.PR math.CO Artificial Intelligence Data Structures and Algorithms Discrete Mathematics

Catalog footprint

What is connected

39works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

While many approaches to improve VQ-VAE performance focus on codebook size and utilization, the effect of dimensional collapse, where trained VQ-VAE representations live in an extremely low-dimensional subspace (1-2% of full rank), remains unaddressed. We show theoretically and empirically that dimension collapse causes a hard loss lower bound that various codebook improvement techniques fail to surpass. Our analytic framework extends the sequential learning effect of Saxe et al. [2014] by introducing ideas from rate-distortion theory and explains how the latent collapse is caused by the VQ suppressing lower-variance directions. Our theory justifies a simple solution: a "warm-up phase" that trains the model as an (unquantized) autoencoder before introducing VQ. On both synthetic experiments and large-scale image (VQGAN) and audio (WavTokenizer) VQ-VAEs, we show that AE Warm-Up successfully restores representation dimension, leading to lower reconstruction and perceptual loss at the same training budget. Across codebook sizes $K \in$ {$2^{10}, 2^{14}, 2^{16}$}, AE warm-up raises VQGAN codebook effective dimension from 3-5 to 17-19 and reduces rFID by 17-35%; on WavTokenizer at $K \in$ {$2^{13}, 2^{14}$}, it raises codebook dimension from 4 to 17-19 and improves PESQ by 11-14%. We empirically characterize how warm-up duration governs the achievable final loss. In agreement with experiment, our theoretical analysis predicts downstream performance as a function of warm-up length, enabling an adaptive criterion for switching from AE Warm-up to VQ-VAE training.

preprint2026arXiv

High-Rate Quantized Matrix Multiplication II

This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $Σ_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.

preprint2026arXiv

Scaling Limits of Long-Context Transformers

We study the long-context limit of softmax self-attention with a fixed query and a random context of $n$ i.i.d. keys on the sphere, viewing the inverse temperature $β_n$ as the scaling parameter that decides whether attention degenerates into uniform averaging or collapses onto the single closest key. We show that the critical scale at which selectivity emerges is determined by the local exponent of the distance-to-query distribution near zero rather than by global features of the context, and scales like $β_n^\ast \asymp n^{2/(d-1)}$ for uniform keys on $\mathbb{S}^{d-1}$. Furthermore, we characterize the limiting laws of the ordered attention weights and of the attention output across all regimes of $β_n$: a subcritical regime in which the output reduces to a local average around $q$ with explicit deterministic bias and Gaussian fluctuations; a critical regime in which a finite collection of nearest keys retains macroscopic mass without single-key collapse; and a supercritical regime in which all mass concentrates on the closest key. Of notable interest is the subcritical case with identity value matrix where the attention map approximately implements a backward heat equation.

preprint2022arXiv

Intrinsic Dimension Estimation Using Wasserstein Distances

It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data.

preprint2021arXiv

Dualizing Le Cam's method for functional estimation, with applications to estimating the unseens

Le Cam's method (or the two-point method) is a commonly used tool for obtaining statistical lower bound and especially popular for functional estimation problems. This work aims to explain and give conditions for the tightness of Le Cam's lower bound in functional estimation from the perspective of convex duality. Under a variety of settings it is shown that the maximization problem that searches for the best two-point lower bound, upon dualizing, becomes a minimization problem that optimizes the bias-variance tradeoff among a family of estimators. For estimating linear functionals of a distribution our work strengthens prior results of Donoho-Liu \cite{DL91} (for quadratic loss) by dropping the Hölderian assumption on the modulus of continuity. For exponential families our results extend those of Juditsky-Nemirovski \cite{JN09} by characterizing the minimax risk for the quadratic loss under weaker assumptions on the exponential family. We also provide an extension to the high-dimensional setting for estimating separable functionals. Notably, coupled with tools from complex analysis, this method is particularly effective for characterizing the ``elbow effect'' -- the phase transition from parametric to nonparametric rates. As the main application we derive sharp minimax rates in the Distinct elements problem (given a fraction $p$ of colored balls from an urn containing $d$ balls, the optimal error of estimating the number of distinct colors is $\tilde Θ(d^{-\frac{1}{2}\min\{\frac{p}{1-p},1\}})$) and the Fisher's species problem (given $n$ iid observations from an unknown distribution, the optimal prediction error of the number of unseen symbols in the next (unobserved) $r \cdot n$ observations is $\tilde Θ(n^{-\min\{\frac{1}{r+1},\frac{1}{2}\}})$).

preprint2021arXiv

Stochastic block model entropy and broadcasting on trees with survey

The limit of the entropy in the stochastic block model (SBM) has been characterized in the sparse regime for the special case of disassortative communities [COKPZ17] and for the classical case of assortative communities but in the dense regime [DAM16]. The problem has not been closed in the classical sparse and assortative case. This paper establishes the result in this case for any SNR besides for the interval (1, 3.513). It further gives an approximation to the limit in this window. The result is obtained by expressing the global SBM entropy as an integral of local tree entropies in a broadcasting on tree model with erasure side-information. The main technical advancement then relies on showing the irrelevance of the boundary in such a model, also studied with variants in [KMS16], [MNS16] and [MX15]. In particular, we establish the uniqueness of the BP fixed point in the survey model for any SNR above 3.513 or below 1. This only leaves a narrow region in the plane between SNR and survey strength where the uniqueness of BP conjectured in these papers remains unproved.

preprint2020arXiv

A Note on the Probability of Rectangles for Correlated Binary Strings

Consider two sequences of $n$ independent and identically distributed fair coin tosses, $X=(X_1,\ldots,X_n)$ and $Y=(Y_1,\ldots,Y_n)$, which are $ρ$-correlated for each $j$, i.e. $\mathbb{P}[X_j=Y_j] = {1+ρ\over 2}$. We study the question of how large (small) the probability $\mathbb{P}[X \in A, Y\in B]$ can be among all sets $A,B\subset\{0,1\}^n$ of a given cardinality. For sets $|A|,|B| = Θ(2^n)$ it is well known that the largest (smallest) probability is approximately attained by concentric (anti-concentric) Hamming balls, and this can be proved via the hypercontractive inequality (reverse hypercontractivity). Here we consider the case of $|A|,|B| = 2^{Θ(n)}$. By applying a recent extension of the hypercontractive inequality of Polyanskiy-Samorodnitsky (J. Functional Analysis, 2019), we show that Hamming balls of the same size approximately maximize $\mathbb{P}[X \in A, Y\in B]$ in the regime of $ρ\to 1$. We also prove a similar tight lower bound, i.e. show that for $ρ\to 0$ the pair of opposite Hamming balls approximately minimizes the probability $\mathbb{P}[X \in A, Y\in B]$.

preprint2020arXiv

Application of information-percolation method to reconstruction problems on graphs

In this paper we propose a method of proving impossibility results based on applying strong data-processing inequalities to estimate mutual information between sets of variables forming certain Markov random fields. The end result is that mutual information between two "far away" (as measured by the graph distance) variables is bounded by the probability of the existence of an open path in a bond-percolation problem on the same graph. Furthermore, stronger bounds can be obtained by establishing mutual information comparison results with an erasure model on the same graph, with erasure probabilities given by the contraction coefficients. As applications, we show that our method gives sharp threshold for partially recovering a rank-one perturbation of a random Gaussian matrix (spiked Wigner model), yields the best known upper bound on the noise level for group synchronization (obtained concurrently by Abbe and Boix), and establishes new impossibility result for community detection on the stochastic block model with $k$ communities.

preprint2020arXiv

Attracting Random Walks

This paper introduces the Attracting Random Walks model, which describes the dynamics of a system of particles on a graph with $n$ vertices. At each step, a single particle moves to an adjacent vertex (or stays at the current one) with probability proportional to the exponent of the number of other particles at a vertex. From an applied standpoint, the model captures the rich get richer phenomenon. We show that the Markov chain exhibits a phase transition in mixing time, as the parameter governing the attraction is varied. Namely, mixing time is $O(n\log n)$ when the temperature is sufficiently high and $\exp(Ω(n))$ when temperature is sufficiently low. When $\mathcal{G}$ is the complete graph, the model is a projection of the Potts model, whose mixing properties and the critical temperature have been known previously. However, for any other graph our model is non-reversible and does not seem to admit a simple Gibbsian description of a stationary distribution. Notably, we demonstrate existence of the dynamic phase transition without decomposing the stationary distribution into phases.

preprint2020arXiv

Broadcasting on Random Directed Acyclic Graphs

We study a generalization of the well-known model of broadcasting on trees. Consider a directed acyclic graph (DAG) with a unique source vertex $X$, and suppose all other vertices have indegree $d\geq 2$. Let the vertices at distance $k$ from $X$ be called layer $k$. At layer $0$, $X$ is given a random bit. At layer $k\geq 1$, each vertex receives $d$ bits from its parents in layer $k-1$, which are transmitted along independent binary symmetric channel edges, and combines them using a $d$-ary Boolean processing function. The goal is to reconstruct $X$ with probability of error bounded away from $1/2$ using the values of all vertices at an arbitrarily deep layer. This question is closely related to models of reliable computation and storage, and information flow in biological networks. In this paper, we analyze randomly constructed DAGs, for which we show that broadcasting is only possible if the noise level is below a certain degree and function dependent critical threshold. For $d\geq 3$, and random DAGs with layer sizes $Ω(\log k)$ and majority processing functions, we identify the critical threshold. For $d=2$, we establish a similar result for NAND processing functions. We also prove a partial converse for odd $d\geq 3$ illustrating that the identified thresholds are impossible to improve by selecting different processing functions if the decoder is restricted to using a single vertex. Finally, for any noise level, we construct explicit DAGs (using expander graphs) with bounded degree and layer sizes $Θ(\log k)$ admitting reconstruction. In particular, we show that such DAGs can be generated in deterministic quasi-polynomial time or randomized polylogarithmic time in the depth. These results portray a doubly-exponential advantage for storing a bit in DAGs compared to trees, where $d=1$ but layer sizes must grow exponentially with depth in order to enable broadcasting.

preprint2020arXiv

Broadcasting on trees near criticality

We revisit the problem of broadcasting on $d$-ary trees: starting from a Bernoulli$(1/2)$ random variable $X_0$ at a root vertex, each vertex forwards its value across binary symmetric channels $\mathrm{BSC}_δ$ to $d$ descendants. The goal is to reconstruct $X_0$ given the vector $X_{L_h}$ of values of all variables at depth $h$. It is well known that reconstruction (better than a random guess) is possible as $h\to \infty$ if and only if $δ< δ_c(d)$. In this paper, we study the behavior of the mutual information and the probability of error when $δ$ is slightly subcritical. The innovation of our work is application of the recently introduced "less-noisy" channel comparison techniques. For example, we are able to derive the positive part of the phase transition (reconstructability when $δ<δ_c$) using purely information-theoretic ideas. This is in contrast with previous derivations, which explicitly analyze distribution of the Hamming weight of $X_{L_h}$ (a so-called Kesten-Stigum bound).

preprint2020arXiv

Convergence of Smoothed Empirical Measures with Applications to Entropy Estimation

This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating $P\ast\mathcal{N}_σ$, for $\mathcal{N}_σ\triangleq\mathcal{N}(0,σ^2 \mathrm{I}_d)$, by $\hat{P}_n\ast\mathcal{N}_σ$, where $\hat{P}_n$ is the empirical measure, under different statistical distances. The convergence is examined in terms of the Wasserstein distance, total variation (TV), Kullback-Leibler (KL) divergence, and $χ^2$-divergence. We show that the approximation error under the TV distance and 1-Wasserstein distance ($\mathsf{W}_1$) converges at rate $e^{O(d)}n^{-\frac{1}{2}}$ in remarkable contrast to a typical $n^{-\frac{1}{d}}$ rate for unsmoothed $\mathsf{W}_1$ (and $d\ge 3$). For the KL divergence, squared 2-Wasserstein distance ($\mathsf{W}_2^2$), and $χ^2$-divergence, the convergence rate is $e^{O(d)}n^{-1}$, but only if $P$ achieves finite input-output $χ^2$ mutual information across the additive white Gaussian noise channel. If the latter condition is not met, the rate changes to $ω(n^{-1})$ for the KL divergence and $\mathsf{W}_2^2$, while the $χ^2$-divergence becomes infinite - a curious dichotomy. As a main application we consider estimating the differential entropy $h(P\ast\mathcal{N}_σ)$ in the high-dimensional regime. The distribution $P$ is unknown but $n$ i.i.d samples from it are available. We first show that any good estimator of $h(P\ast\mathcal{N}_σ)$ must have sample complexity that is exponential in $d$. Using the empirical approximation results we then show that the absolute-error risk of the plug-in estimator converges at the parametric rate $e^{O(d)}n^{-\frac{1}{2}}$, thus establishing the minimax rate-optimality of the plug-in. Numerical results that demonstrate a significant empirical superiority of the plug-in approach to general-purpose differential entropy estimators are provided.

preprint2020arXiv

Extrapolating the profile of a finite population

We study a prototypical problem in empirical Bayes. Namely, consider a population consisting of $k$ individuals each belonging to one of $k$ types (some types can be empty). Without any structural restrictions, it is impossible to learn the composition of the full population having observed only a small (random) subsample of size $m = o(k)$. Nevertheless, we show that in the sublinear regime of $m =ω(k/\log k)$, it is possible to consistently estimate in total variation the \emph{profile} of the population, defined as the empirical distribution of the sizes of each type, which determines many symmetric properties of the population. We also prove that in the linear regime of $m=c k$ for any constant $c$ the optimal rate is $Θ(1/\log k)$. Our estimator is based on Wolfowitz's minimum distance method, which entails solving a linear program (LP) of size $k$. We show that there is a single infinite-dimensional LP whose value simultaneously characterizes the risk of the minimum distance estimator and certifies its minimax optimality. The sharp convergence rate is obtained by evaluating this LP using complex-analytic techniques.

preprint2020arXiv

Massive Access for Future Wireless Communication Systems

Multiple access technology played an important role in wireless communication in the last decades: it increases the capacity of the channel and allows different users to access the system simultaneously. However, the conventional multiple access technology, as originally designed for current human-centric wireless networks, is not scalable for future machine-centric wireless networks. Massive access (studied in the literature under such names as massive-device multiple access, unsourced massive random access, massive connectivity, massive machine-type communication, and many-access channels) exhibits a clean break with current networks by potentially supporting millions of devices in each cellular network. The tremendous growth in the number of connected devices requires a fundamental rethinking of the conventional multiple access technologies in favor of new schemes suited for massive random access. Among the many new challenges arising in this setting, the most relevant are: the fundamental limits of communication from a massive number of bursty devices transmitting simultaneously with short packets, the design of low complexity and energy-efficient massive access coding and communication schemes, efficient methods for the detection of a relatively small number of active users among a large number of potential user devices with sporadic transmission pattern, and the integration of massive access with massive MIMO and other important wireless communication technologies. This paper presents an overview of the concept of massive access wireless communication and of the contemporary research on this important topic.

preprint2020arXiv

Note on approximating the Laplace transform of a Gaussian on a complex disk

In this short note we study how well a Gaussian distribution can be approximated by distributions supported on $[-a,a]$. Perhaps, the natural conjecture is that for large $a$ the almost optimal choice is given by truncating the Gaussian to $[-a,a]$. Indeed, such approximation achieves the optimal rate of $e^{-Θ(a^2)}$ in terms of the $L_\infty$-distance between characteristic functions. However, if we consider the $L_\infty$-distance between Laplace transforms on a complex disk, the optimal rate is $e^{-Θ(a^2 \log a)}$, while truncation still only attains $e^{-Θ(a^2)}$. The optimal rate can be attained by the Gauss-Hermite quadrature. As corollary, we also construct a ``super-flat'' Gaussian mixture of $Θ(a^2)$ components with means in $[-a,a]$ and whose density has all derivatives bounded by $e^{-Ω(a^2 \log(a))}$ in the $O(1)$-neighborhood of the origin.

preprint2020arXiv

Sample complexity of population recovery

The problem of population recovery refers to estimating a distribution based on incomplete or corrupted samples. Consider a random poll of sample size $n$ conducted on a population of individuals, where each pollee is asked to answer $d$ binary questions. We consider one of the two polling impediments: (a) in lossy population recovery, a pollee may skip each question with probability $ε$, (b) in noisy population recovery, a pollee may lie on each question with probability $ε$. Given $n$ lossy or noisy samples, the goal is to estimate the probabilities of all $2^d$ binary vectors simultaneously within accuracy $δ$ with high probability. This paper settles the sample complexity of population recovery. For lossy model, the optimal sample complexity is $\tildeΘ(δ^{-2\max\{\fracε{1-ε},1\}})$, improving the state of the art by Moitra and Saks in several ways: a lower bound is established, the upper bound is improved and the result depends at most on the logarithm of the dimension. Surprisingly, the sample complexity undergoes a phase transition from parametric to nonparametric rate when $ε$ exceeds $1/2$. For noisy population recovery, the sharp sample complexity turns out to be more sensitive to dimension and scales as $\exp(Θ(d^{1/3} \log^{2/3}(1/δ)))$ except for the trivial cases of $ε=0,1/2$ or $1$. For both models, our estimators simply compute the empirical mean of a certain function, which is found by pre-solving a linear program (LP). Curiously, the dual LP can be understood as Le Cam's method for lower-bounding the minimax risk, thus establishing the statistical optimality of the proposed estimators. The value of the LP is determined by complex-analytic methods.

preprint2020arXiv

Self-regularizing Property of Nonparametric Maximum Likelihood Estimator in Mixture Models

Introduced by Kiefer and Wolfowitz \cite{KW56}, the nonparametric maximum likelihood estimator (NPMLE) is a widely used methodology for learning mixture odels and empirical Bayes estimation. Sidestepping the non-convexity in mixture likelihood, the NPMLE estimates the mixing distribution by maximizing the total likelihood over the space of probability measures, which can be viewed as an extreme form of overparameterization. In this paper we discover a surprising property of the NPMLE solution. Consider, for example, a Gaussian mixture model on the real line with a subgaussian mixing distribution. Leveraging complex-analytic techniques, we show that with high probability the NPMLE based on a sample of size $n$ has $O(\log n)$ atoms (mass points), significantly improving the deterministic upper bound of $n$ due to Lindsay \cite{lindsay1983geometry1}. Notably, any such Gaussian mixture is statistically indistinguishable from a finite one with $O(\log n)$ components (and this is tight for certain mixtures). Thus, absent any explicit form of model selection, NPMLE automatically chooses the right model complexity, a property we term \emph{self-regularization}. Extensions to other exponential families are given. As a statistical application, we show that this structural property can be harnessed to bootstrap existing Hellinger risk bound of the (parametric) MLE for finite Gaussian mixtures to the NPMLE for general Gaussian mixtures, recovering a result of Zhang \cite{zhang2009generalized}.

preprint2020arXiv

The Information Bottleneck Problem and Its Applications in Machine Learning

Inference capabilities of machine learning (ML) systems skyrocketed in recent years, now playing a pivotal role in various aspect of society. The goal in statistical learning is to use data to obtain simple algorithms for predicting a random variable $Y$ from a correlated observation $X$. Since the dimension of $X$ is typically huge, computationally feasible solutions should summarize it into a lower-dimensional feature vector $T$, from which $Y$ is predicted. The algorithm will successfully make the prediction if $T$ is a good proxy of $Y$, despite the said dimensionality-reduction. A myriad of ML algorithms (mostly employing deep learning (DL)) for finding such representations $T$ based on real-world data are now available. While these methods are often effective in practice, their success is hindered by the lack of a comprehensive theory to explain it. The information bottleneck (IB) theory recently emerged as a bold information-theoretic paradigm for analyzing DL systems. Adopting mutual information as the figure of merit, it suggests that the best representation $T$ should be maximally informative about $Y$ while minimizing the mutual information with $X$. In this tutorial we survey the information-theoretic origins of this abstract principle, and its recent impact on DL. For the latter, we cover implications of the IB problem on DL theory, as well as practical algorithms inspired by it. Our goal is to provide a unified and cohesive description. A clear view of current knowledge is particularly important for further leveraging IB and other information-theoretic ideas to study DL models.

preprint2016arXiv

A Beta-Beta Achievability Bound with Applications

A channel coding achievability bound expressed in terms of the ratio between two Neyman-Pearson $β$ functions is proposed. This bound is the dual of a converse bound established earlier by Polyanskiy and Verdú (2014). The new bound turns out to simplify considerably the analysis in situations where the channel output distribution is not a product distribution, for example due to a cost constraint or a structural constraint (such as orthogonality or constant composition) on the channel inputs. Connections to existing bounds in the literature are discussed. The bound is then used to derive 1) an achievability bound on the channel dispersion of additive non-Gaussian noise channels with random Gaussian codebooks, 2) the channel dispersion of the exponential-noise channel, 3) a second-order expansion for the minimum energy per bit of an AWGN channel, and 4) a lower bound on the maximum coding rate of a multiple-input multiple-output Rayleigh-fading channel with perfect channel state information at the receiver, which is the tightest known achievability result.

preprint2016arXiv

Bounds on the Reliability of a Typewriter Channel

We give new bounds on the reliability function of a typewriter channel with 5 inputs and crossover probability $1/2$. The lower bound is more of theoretical than practical importance; it improves very marginally the expurgated bound, providing a counterexample to a conjecture on its tightness by Shannon, Gallager and Berlekamp which does not need the construction of algebraic-geometric codes previously used by Katsman, Tsfasman and Vlăduţ. The upper bound is derived by using an adaptation of the linear programming bound and it is essentially useful as a low-rate anchor for the straight line bound.

preprint2016arXiv

Minimum Energy to Send $k$ Bits Over Multiple-Antenna Fading Channels

This paper investigates the minimum energy required to transmit $k$ information bits with a given reliability over a multiple-antenna Rayleigh block-fading channel, with and without channel state information (CSI) at the receiver. No feedback is assumed. It is well known that the ratio between the minimum energy per bit and the noise level converges to $-1.59$ dB as $k$ goes to infinity, regardless of whether CSI is available at the receiver or not. This paper shows that lack of CSI at the receiver causes a slowdown in the speed of convergence to $-1.59$ dB as $k\to\infty$ compared to the case of perfect receiver CSI. Specifically, we show that, in the no-CSI case, the gap to $-1.59$ dB is proportional to $((\log k) /k)^{1/3}$, whereas when perfect CSI is available at the receiver, this gap is proportional to $1/\sqrt{k}$. In both cases, the gap to $-1.59$ dB is independent of the number of transmit antennas and of the channel's coherence time. Numerically, we observe that, when the receiver is equipped with a single antenna, to achieve an energy per bit of $ - 1.5$ dB in the no-CSI case, one needs to transmit at least $7\times 10^7$ information bits, whereas $6\times 10^4$ bits suffice for the case of perfect CSI at the receiver.

preprint2016arXiv

On metric properties of maps between Hamming spaces and related graph homomorphisms

A mapping of $k$-bit strings into $n$-bit strings is called an $(α,β)$-map if $k$-bit strings which are more than $αk$ apart are mapped to $n$-bit strings that are more than $βn$ apart. This is a relaxation of the classical problem of constructing error-correcting codes, which corresponds to $α=0$. Existence of an $(α,β)$-map is equivalent to existence of a graph homomorphism $\bar H(k,αk)\to \bar H(n,βn)$, where $H(n,d)$ is a Hamming graph with vertex set $\{0,1\}^n$ and edges connecting vertices differing in $d$ or fewer entries. This paper proves impossibility results on achievable parameters $(α,β)$ in the regime of $n,k\to\infty$ with a fixed ratio ${n\over k}= ρ$. This is done by developing a general criterion for existence of graph-homomorphism based on the semi-definite relaxation of the independence number of a graph (known as the Schrijver's $θ$-function). The criterion is then evaluated using some known and some new results from coding theory concerning the $θ$-function of Hamming graphs. As an example, it is shown that if $β>1/2$ and $n\over k$ -- integer, the ${n\over k}$-fold repetition map achieving $α=β$ is asymptotically optimal. Finally, constraints on configurations of points and hyperplanes in projective spaces over $\mathbb{F}_2$ are derived.

preprint2016arXiv

Rate-distance tradeoff for codes above graph capacity

The capacity of a graph is defined as the rate of exponential growth of independent sets in the strong powers of the graph. In the strong power an edge connects two sequences if at each position their letters are equal or adjacent. We consider a variation of the problem where edges in the power graphs are removed between sequences which differ in more than a fraction $δ$ of coordinates. The proposed generalization can be interpreted as the problem of determining the highest rate of zero undetected-error communication over a link with adversarial noise, where only a fraction $δ$ of symbols can be perturbed and only some substitutions are allowed. We derive lower bounds on achievable rates by combining graph homomorphisms with a graph-theoretic generalization of the Gilbert-Varshamov bound. We then give an upper bound, based on Delsarte's linear programming approach, which combines Lovász' theta function with the construction used by McEliece et al. for bounding the minimum distance of codes in Hamming spaces.

preprint2016arXiv

Strong data-processing inequalities for channels and Bayesian networks

The data-processing inequality, that is, $I(U;Y) \le I(U;X)$ for a Markov chain $U \to X \to Y$, has been the method of choice for proving impossibility (converse) results in information theory and many other disciplines. Various channel-dependent improvements (called strong data-processing inequalities, or SDPIs) of this inequality have been proposed both classically and more recently. In this note we first survey known results relating various notions of contraction for a single channel. Then we consider the basic extension: given SDPI for each constituent channel in a Bayesian network, how to produce an end-to-end SDPI? Our approach is based on the (extract of the) Evans-Schulman method, which is demonstrated for three different kinds of SDPIs, namely, the usual Ahslwede-Gács type contraction coefficients (mutual information), Dobrushin's contraction coefficients (total variation), and finally the $F_I$-curve (the best possible non-linear SDPI for a given channel). Resulting bounds on the contraction coefficients are interpreted as probability of site percolation. As an example, we demonstrate how to obtain SDPI for an $n$-letter memoryless channel with feedback given an SDPI for $n=1$. Finally, we discuss a simple observation on the equivalence of a linear SDPI and comparison to an erasure channel (in the sense of "less noisy" order). This leads to a simple proof of a curious inequality of Samorodnitsky (2015), and sheds light on how information spreads in the subsets of inputs of a memoryless channel.

preprint2016arXiv

Wasserstein continuity of entropy and outer bounds for interference channels

It is shown that under suitable regularity conditions, differential entropy is a Lipschitz functional on the space of distributions on $n$-dimensional Euclidean space with respect to the quadratic Wasserstein distance. Under similar conditions, (discrete) Shannon entropy is shown to be Lipschitz continuous in distributions over the product space with respect to Ornstein's $\bar d$-distance (Wasserstein distance corresponding to the Hamming distance). These results together with Talagrand's and Marton's transportation-information inequalities allow one to replace the unknown multi-user interference with its i.i.d. approximations. As an application, a new outer bound for the two-user Gaussian interference channel is proved, which, in particular, settles the "missing corner point" problem of Costa (1985).

preprint2015arXiv

Bounds for codes on pentagon and other cycles

The capacity of a graph is defined as the rate of exponential grow of independent sets in the strong powers of the graph. In strong power, an edge connects two sequences if at each position letters are equal or adjacent. We consider a variation of the problem where edges in the power graphs are removed among sequences which differ in more than a fraction $δ$ of coordinates. For odd cycles, we derive an upper bound on the corresponding rate which combines Lovász' bound on the capacity with Delsarte's linear programming bounds on the minimum distance of codes in Hamming spaces. For the pentagon, this shows that for $δ\ge {1-{1\over\sqrt{5}}}$ the Lovász rate is the best possible, while we prove by a Gilbert-Varshamov-type bound that a higher rate is achievable for $δ< {2\over 5}$. Communication interpretation of this question is the problem of sending quinary symbols subject to $\pm 1\mod 5$ disturbance. The maximal communication rate subject to the zero undetected-error equals capacity of a pentagon. The question addressed here is how much this rate can be increased if only a fraction $δ$ of symbols is allowed to be disturbed

preprint2015arXiv

Dissipation of information in channels with input constraints

One of the basic tenets in information theory, the data processing inequality states that output divergence does not exceed the input divergence for any channel. For channels without input constraints, various estimates on the amount of such contraction are known, Dobrushin's coefficient for the total variation being perhaps the most well-known. This work investigates channels with average input cost constraint. It is found that while the contraction coefficient typically equals one (no contraction), the information nevertheless dissipates. A certain non-linear function, the \emph{Dobrushin curve} of the channel, is proposed to quantify the amount of dissipation. Tools for evaluating the Dobrushin curve of additive-noise channels are developed based on coupling arguments. Some basic applications in stochastic control, uniqueness of Gibbs measures and fundamental limits of noisy circuits are discussed. As an application, it shown that in the chain of $n$ power-constrained relays and Gaussian channels the end-to-end mutual information and maximal squared correlation decay as $Θ(\frac{\log\log n}{\log n})$, which is in stark contrast with the exponential decay in chains of discrete channels. Similarly, the behavior of noisy circuits (composed of gates with bounded fan-in) and broadcasting of information on trees (of bounded degree) does not experience threshold behavior in the signal-to-noise ratio (SNR). Namely, unlike the case of discrete channels, the probability of bit error stays bounded away from $1\over 2$ regardless of the SNR.

preprint2015arXiv

Optimum Power Control at Finite Blocklength

This paper investigates the maximal channel coding rate achievable at a given blocklength $n$ and error probability $ε$, when the codewords are subject to a long-term (i.e., averaged-over-all-codeword) power constraint. The second-order term in the large-$n$ expansion of the maximal channel coding rate is characterized both for additive white Gaussian noise (AWGN) channels and for quasi-static fading channels with perfect channel state information available at both the transmitter and the receiver. It is shown that in both cases the second-order term is proportional to $\sqrt{n^{-1}\ln n}$. For the quasi-static fading case, this second-order term is achieved by truncated channel inversion, namely, by concatenating a dispersion-optimal code for an AWGN channel subject to a short-term power constraint, with a power controller that inverts the channel whenever the fading gain is above a certain threshold. Easy-to-evaluate approximations of the maximal channel coding rate are developed for both the AWGN and the quasi-static fading case.

preprint2015arXiv

Short-Packet Communications over Multiple-Antenna Rayleigh-Fading Channels

Motivated by the current interest in ultra-reliable, low-latency, machine-type communication systems, we investigate the tradeoff between reliability, throughput, and latency in the transmission of information over multiple-antenna Rayleigh block-fading channels. Specifically, we obtain finite-blocklength, finite-SNR upper and lower bounds on the maximum coding rate achievable over such channels for a given constraint on the packet error probability. Numerical evidence suggests that our bounds delimit tightly the maximum coding rate already for short blocklengths (packets of about 100 symbols). Furthermore, our bounds reveal the existence of a tradeoff between the rate gain obtainable by spreading each codeword over all available time-frequency-spatial degrees of freedom, and the rate loss caused by the need of estimating the fading coefficients over these degrees of freedom. In particular, our bounds allow us to determine the optimal number of transmit antennas and the optimal number of time-frequency diversity branches that maximize the rate. Finally, we show that infinite-blocklength performance metrics such as the ergodic capacity and the outage capacity yield inaccurate throughput estimates.

preprint2015arXiv

Upper bound on list-decoding radius of binary codes

Consider the problem of packing Hamming balls of a given relative radius subject to the constraint that they cover any point of the ambient Hamming space with multiplicity at most $L$. For odd $L\ge 3$ an asymptotic upper bound on the rate of any such packing is proven. Resulting bound improves the best known bound (due to Blinovsky'1986) for rates below a certain threshold. Method is a superposition of the linear-programming idea of Ashikhmin, Barg and Litsyn (that was used previously to improve the estimates of Blinovsky for $L=2$) and a Ramsey-theoretic technique of Blinovsky. As an application it is shown that for all odd $L$ the slope of the rate-radius tradeoff is zero at zero rate.

preprint2015arXiv

Variable-length compression allowing errors

This paper studies the fundamental limits of the minimum average length of lossless and lossy variable-length compression, allowing a nonzero error probability $ε$, for lossless compression. We give non-asymptotic bounds on the minimum average length in terms of Erokhin's rate-distortion function and we use those bounds to obtain a Gaussian approximation on the speed of approach to the limit which is quite accurate for all but small blocklengths: $$(1 - ε) k H(\mathsf S) - \sqrt{\frac{k V(\mathsf S)}{2 π} } e^{- \frac {(Q^{-1}(ε))^2} 2 }$$ where $Q^{-1}(\cdot)$ is the functional inverse of the standard Gaussian complementary cdf, and $V(\mathsf S)$ is the source dispersion. A nonzero error probability thus not only reduces the asymptotically achievable rate by a factor of $1 - ε$, but this asymptotic limit is approached from below, i.e. larger source dispersions and shorter blocklengths are beneficial. Variable-length lossy compression under an excess distortion constraint is shown to exhibit similar properties.

preprint2014arXiv

Algebraic Methods of Classifying Directed Graphical Models

Directed acyclic graphical models (DAGs) are often used to describe common structural properties in a family of probability distributions. This paper addresses the question of classifying DAGs up to an isomorphism. By considering Gaussian densities, the question reduces to verifying equality of certain algebraic varieties. A question of computing equations for these varieties has been previously raised in the literature. Here it is shown that the most natural method adds spurious components with singular principal minors, proving a conjecture of Sullivant. This characterization is used to establish an algebraic criterion for isomorphism, and to provide a randomized algorithm for checking that criterion. Results are applied to produce a list of the isomorphism classes of tree models on 4,5, and 6 nodes. Finally, some evidence is provided to show that projectivized DAG varieties contain useful information in the sense that their relative embedding is closely related to efficient inference.

preprint2014arXiv

Peak-to-average power ratio of good codes for Gaussian channel

Consider a problem of forward error-correction for the additive white Gaussian noise (AWGN) channel. For finite blocklength codes the backoff from the channel capacity is inversely proportional to the square root of the blocklength. In this paper it is shown that codes achieving this tradeoff must necessarily have peak-to-average power ratio (PAPR) proportional to logarithm of the blocklength. This is extended to codes approaching capacity slower, and to PAPR measured at the output of an OFDM modulator. As a by-product the convergence of (Smith's) amplitude-constrained AWGN capacity to Shannon's classical formula is characterized in the regime of large amplitudes. This converse-type result builds upon recent contributions in the study of empirical output distributions of good channel codes.

preprint2014arXiv

Quasi-Static Multiple-Antenna Fading Channels at Finite Blocklength

This paper investigates the maximal achievable rate for a given blocklength and error probability over quasi-static multiple-input multiple-output (MIMO) fading channels, with and without channel state information (CSI) at the transmitter and/or the receiver. The principal finding is that outage capacity, despite being an asymptotic quantity, is a sharp proxy for the finite-blocklength fundamental limits of slow-fading channels. Specifically, the channel dispersion is shown to be zero regardless of whether the fading realizations are available at both transmitter and receiver, at only one of them, or at neither of them. These results follow from analytically tractable converse and achievability bounds. Numerical evaluation of these bounds verifies that zero dispersion may indeed imply fast convergence to the outage capacity as the blocklength increases. In the example of a particular $1 \times 2$ single-input multiple-output (SIMO) Rician fading channel, the blocklength required to achieve $90\%$ of capacity is about an order of magnitude smaller compared to the blocklength required for an AWGN channel with the same capacity. For this specific scenario, the coding/decoding schemes adopted in the LTE-Advanced standard are benchmarked against the finite-blocklength achievability and converse bounds.

preprint2013arXiv

Empirical distribution of good channel codes with non-vanishing error probability (extended version)

This paper studies several properties of channel codes that approach the fundamental limits of a given (discrete or Gaussian) memoryless channel with a non-vanishing probability of error. The output distribution induced by an $ε$-capacity-achieving code is shown to be close in a strong sense to the capacity achieving output distribution. Relying on the concentration of measure (isoperimetry) property enjoyed by the latter, it is shown that regular (Lipschitz) functions of channel outputs can be precisely estimated and turn out to be essentially non-random and independent of the actual code. It is also shown that the output distribution of a good code and the capacity achieving one cannot be distinguished with exponential reliability. The random process produced at the output of the channel is shown to satisfy the asymptotic equipartition property. Using related methods it is shown that quadratic forms and sums of $q$-th powers when evaluated at codewords of good AWGN codes approach the values obtained from a randomly generated Gaussian codeword.

preprint2013arXiv

On Locally Decodable Source Coding

Locally decodable channel codes form a special class of error-correcting codes with the property that the decoder is able to reconstruct any bit of the input message from querying only a few bits of a noisy codeword. It is well known that such codes require significantly more redundancy (in particular have vanishing rate) compared to their non-local counterparts. In this paper, we define a dual problem, i.e. locally decodable source codes (LDSC). We consider both almost lossless (block error) and lossy (bit error) cases. In almost lossless case, we show that optimal compression (to entropy) is possible with O(log n) queries to compressed string by the decompressor. We also show the following converse bounds: 1) linear LDSC cannot achieve any rate below one, with a bounded number of queries, 2) rate of any source coding with linear decoder (not necessarily local) in one, 3) for 2 queries, any code construction cannot have a rate below one. In lossy case, we show that any rate above rate distortion is achievable with a bounded number of queries. We also show that, rate distortion is achievable with any scaling number of queries. We provide an achievability bound in the finite block-length regime and compare it with the existing bounds in succinct data structures literature.

preprint2013arXiv

Quasi-Static SIMO Fading Channels at Finite Blocklength

We investigate the maximal achievable rate for a given blocklength and error probability over quasi-static single-input multiple-output (SIMO) fading channels. Under mild conditions on the channel gains, it is shown that the channel dispersion is zero regardless of whether the fading realizations are available at the transmitter and/or the receiver. The result follows from computationally and analytically tractable converse and achievability bounds. Through numerical evaluation, we verify that, in some scenarios, zero dispersion indeed entails fast convergence to outage capacity as the blocklength increases. In the example of a particular 1*2 SIMO Rician channel, the blocklength required to achieve 90% of capacity is about an order of magnitude smaller compared to the blocklength required for an AWGN channel with the same capacity.

preprint2013arXiv

Tight Lower Bound for Linear Sketches of Moments

The problem of estimating frequency moments of a data stream has attracted a lot of attention since the onset of streaming algorithms [AMS99]. While the space complexity for approximately computing the $p^{\rm th}$ moment, for $p\in(0,2]$ has been settled [KNW10], for $p>2$ the exact complexity remains open. For $p>2$ the current best algorithm uses $O(n^{1-2/p}\log n)$ words of space [AKO11,BO10], whereas the lower bound is of $Ω(n^{1-2/p})$ [BJKS04]. In this paper, we show a tight lower bound of $Ω(n^{1-2/p}\log n)$ words for the class of algorithms based on linear sketches, which store only a sketch $Ax$ of input vector $x$ and some (possibly randomized) matrix $A$. We note that all known algorithms for this problem are linear sketches.

preprint2012arXiv

Diversity versus Channel Knowledge at Finite Block-Length

We study the maximal achievable rate R*(n, ε) for a given block-length n and block error probability εover Rayleigh block-fading channels in the noncoherent setting and in the finite block-length regime. Our results show that for a given block-length and error probability, R*(n, ε) is not monotonic in the channel's coherence time, but there exists a rate maximizing coherence time that optimally trades between diversity and cost of estimating the channel.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint

Fields this researcher appears in

Information Theory math.IT math.ST Statistics Theory Machine Learning math.PR math.CO Artificial Intelligence Data Structures and Algorithms Discrete Mathematics

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2605.06870:author:6:yury-polyanskiy

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.08505:author:4:yury-polyanskiy

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.13768:author:2:yury-polyanskiy

Imported May 20, 2026Synced May 21, 2026

11 works

Yihong Wu

Researcher

Yihong Wu contributes to research discovery and scholarly infrastructure.

Open to collaborate

8 works

Wei Yang

Researcher

Wei Yang contributes to research discovery and scholarly infrastructure.

Open to collaborate

7 works

Giuseppe Durisi

Researcher

Giuseppe Durisi contributes to research discovery and scholarly infrastructure.

Open to collaborate

4 works

Tobias Koch

Researcher

Tobias Koch contributes to research discovery and scholarly infrastructure.

Open to collaborate

Yury Polyanskiy

What is connected

Connect this record

See the researcher in context

Building this map preview

39 published item(s)

Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

High-Rate Quantized Matrix Multiplication II

Scaling Limits of Long-Context Transformers

Intrinsic Dimension Estimation Using Wasserstein Distances

Dualizing Le Cam's method for functional estimation, with applications to estimating the unseens

Stochastic block model entropy and broadcasting on trees with survey

A Note on the Probability of Rectangles for Correlated Binary Strings

Application of information-percolation method to reconstruction problems on graphs

Attracting Random Walks

Broadcasting on Random Directed Acyclic Graphs

Broadcasting on trees near criticality

Convergence of Smoothed Empirical Measures with Applications to Entropy Estimation

Extrapolating the profile of a finite population

Massive Access for Future Wireless Communication Systems

Note on approximating the Laplace transform of a Gaussian on a complex disk

Sample complexity of population recovery

Self-regularizing Property of Nonparametric Maximum Likelihood Estimator in Mixture Models

The Information Bottleneck Problem and Its Applications in Machine Learning

A Beta-Beta Achievability Bound with Applications

Bounds on the Reliability of a Typewriter Channel

Minimum Energy to Send $k$ Bits Over Multiple-Antenna Fading Channels

On metric properties of maps between Hamming spaces and related graph homomorphisms

Rate-distance tradeoff for codes above graph capacity

Strong data-processing inequalities for channels and Bayesian networks

Wasserstein continuity of entropy and outer bounds for interference channels

Bounds for codes on pentagon and other cycles

Dissipation of information in channels with input constraints

Optimum Power Control at Finite Blocklength

Short-Packet Communications over Multiple-Antenna Rayleigh-Fading Channels

Upper bound on list-decoding radius of binary codes

Variable-length compression allowing errors

Algebraic Methods of Classifying Directed Graphical Models

Peak-to-average power ratio of good codes for Gaussian channel

Quasi-Static Multiple-Antenna Fading Channels at Finite Blocklength

Empirical distribution of good channel codes with non-vanishing error probability (extended version)

On Locally Decodable Source Coding

Quasi-Static SIMO Fading Channels at Finite Blocklength

Tight Lower Bound for Linear Sketches of Moments

Diversity versus Channel Knowledge at Finite Block-Length