Source author record

Marco Mondelli

Marco Mondelli appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT Machine Learning math.ST Statistics Theory Artificial Intelligence Distributed, Parallel, and Cluster Computing

Catalog footprint

What is connected

21works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing

Learning to generalise from limited data is a fundamental challenge for both artificial and biological systems. A common strategy is to extract reusable structure from abundant unlabelled data, enabling efficient adaptation to new tasks from limited labelled data. This two-stage paradigm is now standard in modern training pipelines, where pretraining is followed by fine-tuning or linear probing. We provide an analytical model of this process: structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset. In the high-dimensional regime, we derive exact expressions for training and generalisation error showcasing their dependence on representation dimensionality, unlabelled and labelled sample sizes, and task alignment. Our results show that pretrained representations strongly influence downstream generalisation, and we characterize the optimal representation size as a function of task parameters: with abundant pretraining data but scarce downstream data, maximally compressed representations are optimal, whereas with limited pretraining data, higher-dimensional representations generalise better. Furthermore, we establish an exact trade-off between pretraining and supervision, quantifying how much unlabelled data is required to replace a single labelled sample. Beyond our idealised model, we observe similar phenomenology in autoencoders and pretrained LLMs. Altogether, we highlight that optimising representation size is critical, giving conditions for when compression during pretraining improves generalisation.

preprint2026arXiv

Precise Asymptotics for Spectral Methods in Mixed Generalized Linear Models

In a mixed generalized linear model, the goal is to learn multiple signals from unlabeled observations: each sample comes from exactly one signal, but it is not known which one. We consider the prototypical problem of estimating two statistically independent signals in a mixed generalized linear model with Gaussian covariates. Spectral methods are a popular class of estimators which output the top two eigenvectors of a suitable data-dependent matrix. However, despite the wide applicability, their design is still obtained via heuristic considerations, and the number of samples $n$ needed to guarantee recovery is super-linear in the signal dimension $d$. In this paper, we develop exact asymptotics on spectral methods in the challenging proportional regime in which $n, d$ grow large and their ratio converges to a finite constant. This allows us optimize the design of the spectral method, and combine it with a simple linear estimator, to minimize the estimation error. Our characterization exploits a mix of tools from random matrices, free probability and the theory of approximate message passing algorithms. Numerical simulations for mixed linear regression and phase retrieval demonstrate the advantage enabled by our analysis over existing designs of spectral methods.

preprint2026arXiv

Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention

This paper studies the role of sinks and diagonal patterns as attention switch and anti-oversmoothing mechanisms. We analyze geometric conditions under which sinks can be represented, showing a necessary alignment between the embedding of the sink and all other embeddings. Next, we refine the current understanding of the role of sinks in oversmoothing prevention: we specify the conditions under which dense attention provably smooths more than sparse attention, and empirically verify that such conditions are often satisfied in practice. We further prove an equivalence between sinks and hard attention switch, in which the output of the attention is identically 0. Finally, we relax the hard attention switch by allowing token self-communication: we provide a quantitative comparison of the costs of representing sinks vs.\ diagonal patterns, showing why sinks are favored in pretrained transformers. The introduction and analysis of diagonal patterns and the generalization of the attention switch close the gap between what oversmoothing prevention requires and what sinks provide, while also establishing when and why attention layers act like MLPs if token communication is not necessary.

preprint2022arXiv

Estimation in Rotationally Invariant Generalized Linear Models via Approximate Message Passing

We consider the problem of signal estimation in generalized linear models defined via rotationally invariant design matrices. Since these matrices can have an arbitrary spectral distribution, this model is well suited for capturing complex correlation structures which often arise in applications. We propose a novel family of approximate message passing (AMP) algorithms for signal estimation, and rigorously characterize their performance in the high-dimensional limit via a state evolution recursion. Our rotationally invariant AMP has complexity of the same order as the existing AMP derived under the restrictive assumption of a Gaussian design; our algorithm also recovers this existing AMP as a special case. Numerical results showcase a performance close to Vector AMP (which is conjectured to be Bayes-optimal in some settings), but obtained with a much lower complexity, as the proposed algorithm does not require a computationally expensive singular value decomposition.

preprint2022arXiv

Fundamental Limits of Two-layer Autoencoders, and Achieving Them with Gradient Methods

Autoencoders are a popular model in many branches of machine learning and lossy data compression. However, their fundamental limits, the performance of gradient methods and the features learnt during optimization remain poorly understood, even in the two-layer setting. In fact, earlier work has considered either linear autoencoders or specific training regimes (leading to vanishing or diverging compression rates). Our paper addresses this gap by focusing on non-linear two-layer autoencoders trained in the challenging proportional regime in which the input dimension scales linearly with the size of the representation. Our results characterize the minimizers of the population risk, and show that such minimizers are achieved by gradient methods; their structure is also unveiled, thus leading to a concise description of the features obtained via training. For the special case of a sign activation function, our analysis establishes the fundamental limits for the lossy compression of Gaussian sources via (shallow) autoencoders. Finally, while the results are proved for Gaussian data, numerical simulations on standard datasets display the universality of the theoretical predictions.

preprint2022arXiv

Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks

Understanding the properties of neural networks trained via stochastic gradient descent (SGD) is at the heart of the theory of deep learning. In this work, we take a mean-field view, and consider a two-layer ReLU network trained via SGD for a univariate regularized regression problem. Our main result is that SGD is biased towards a simple solution: at convergence, the ReLU network implements a piecewise linear map of the inputs, and the number of "knot" points - i.e., points where the tangent of the ReLU network estimator changes - between two consecutive training inputs is at most three. In particular, as the number of neurons of the network grows, the SGD dynamics is captured by the solution of a gradient flow and, at convergence, the distribution of the weights approaches the unique minimizer of a related free energy, which has a Gibbs form. Our key technical contribution consists in the analysis of the estimator resulting from this minimizer: we show that its second derivative vanishes everywhere, except at some specific locations which represent the "knot" points. We also provide empirical evidence that knots at locations distinct from the data points might occur, as predicted by our theory.

preprint2022arXiv

Polar Coded Computing: The Role of the Scaling Exponent

We consider the problem of coded distributed computing using polar codes. The average execution time of a coded computing system is related to the error probability for transmission over the binary erasure channel in recent work by Soleymani, Jamali and Mahdavifar, where the performance of binary linear codes is investigated. In this paper, we focus on polar codes and unveil a connection between the average execution time and the scaling exponent $μ$ of the family of codes. The scaling exponent has emerged as a central object in the finite-length characterization of polar codes, and it captures the speed of convergence to capacity. In particular, we show that (i) the gap between the normalized average execution time of polar codes and that of optimal MDS codes is $O(n^{-1/μ})$, and (ii) this upper bound can be improved to roughly $O(n^{-1/2})$ by considering polar codes with large kernels. We conjecture that these bounds could be improved to $O(n^{-2/μ})$ and $O(n^{-1})$, respectively, and provide a heuristic argument as well as numerical evidence supporting this view.

preprint2022arXiv

Sharp asymptotics on the compression of two-layer neural networks

In this paper, we study the compression of a target two-layer neural network with N nodes into a compressed network with M<N nodes. More precisely, we consider the setting in which the weights of the target network are i.i.d. sub-Gaussian, and we minimize the population L_2 loss between the outputs of the target and of the compressed network, under the assumption of Gaussian inputs. By using tools from high-dimensional probability, we show that this non-convex problem can be simplified when the target network is sufficiently over-parameterized, and provide the error rate of this approximation as a function of the input dimension and N. In this mean-field limit, the simplified objective, as well as the optimal weights of the compressed network, does not depend on the realization of the target network, but only on expected scaling factors. Furthermore, for networks with ReLU activation, we conjecture that the optimum of the simplified optimization problem is achieved by taking weights on the Equiangular Tight Frame (ETF), while the scaling of the weights and the orientation of the ETF depend on the parameters of the target network. Numerical evidence is provided to support this conjecture.

preprint2022arXiv

The price of ignorance: how much does it cost to forget noise structure in low-rank matrix estimation?

We consider the problem of estimating a rank-1 signal corrupted by structured rotationally invariant noise, and address the following question: how well do inference algorithms perform when the noise statistics is unknown and hence Gaussian noise is assumed? While the matched Bayes-optimal setting with unstructured noise is well understood, the analysis of this mismatched problem is only at its premises. In this paper, we make a step towards understanding the effect of the strong source of mismatch which is the noise statistics. Our main technical contribution is the rigorous analysis of a Bayes estimator and of an approximate message passing (AMP) algorithm, both of which incorrectly assume a Gaussian setup. The first result exploits the theory of spherical integrals and of low-rank matrix perturbations; the idea behind the second one is to design and analyze an artificial AMP which, by taking advantage of the flexibility in the denoisers, is able to "correct" the mismatch. Armed with these sharp asymptotic characterizations, we unveil a rich and often unexpected phenomenology. For example, despite AMP is in principle designed to efficiently compute the Bayes estimator, the former is outperformed by the latter in terms of mean-square error. We show that this performance gap is due to an incorrect estimation of the signal norm. In fact, when the SNR is large enough, the overlaps of the AMP and the Bayes estimator coincide, and they even match those of optimal estimators taking into account the structure of the noise.

preprint2022arXiv

Tight Bounds on the Smallest Eigenvalue of the Neural Tangent Kernel for Deep ReLU Networks

A recent line of work has analyzed the theoretical properties of deep neural networks via the Neural Tangent Kernel (NTK). In particular, the smallest eigenvalue of the NTK has been related to the memorization capacity, the global convergence of gradient descent algorithms and the generalization of deep nets. However, existing results either provide bounds in the two-layer setting or assume that the spectrum of the NTK matrices is bounded away from 0 for multi-layer networks. In this paper, we provide tight bounds on the smallest eigenvalue of NTK matrices for deep ReLU nets, both in the limiting case of infinite widths and for finite widths. In the finite-width setting, the network architectures we consider are fairly general: we require the existence of a wide layer with roughly order of $N$ neurons, $N$ being the number of data samples; and the scaling of the remaining layer widths is arbitrary (up to logarithmic factors). To obtain our results, we analyze various quantities of independent interest: we give lower bounds on the smallest singular value of hidden feature matrices, and upper bounds on the Lipschitz constant of input-output feature maps.

preprint2021arXiv

Approximate Message Passing with Spectral Initialization for Generalized Linear Models

We consider the problem of estimating a signal from measurements obtained via a generalized linear model. We focus on estimators based on approximate message passing (AMP), a family of iterative algorithms with many appealing features: the performance of AMP in the high-dimensional limit can be succinctly characterized under suitable model assumptions; AMP can also be tailored to the empirical distribution of the signal entries, and for a wide class of estimation problems, AMP is conjectured to be optimal among all polynomial-time algorithms. However, a major issue of AMP is that in many models (such as phase retrieval), it requires an initialization correlated with the ground-truth signal and independent from the measurement matrix. Assuming that such an initialization is available is typically not realistic. In this paper, we solve this problem by proposing an AMP algorithm initialized with a spectral estimator. With such an initialization, the standard AMP analysis fails since the spectral estimator depends in a complicated way on the design matrix. Our main contribution is a rigorous characterization of the performance of AMP with spectral initialization in the high-dimensional limit. The key technical idea is to define and analyze a two-phase artificial AMP algorithm that first produces the spectral estimator, and then closely approximates the iterates of the true AMP. We also provide numerical results that demonstrate the validity of the proposed approach.

preprint2020arXiv

Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks

The optimization of multilayer neural networks typically leads to a solution with zero training error, yet the landscape can exhibit spurious local minima and the minima can be disconnected. In this paper, we shed light on this phenomenon: we show that the combination of stochastic gradient descent (SGD) and over-parameterization makes the landscape of multilayer neural networks approximately connected and thus more favorable to optimization. More specifically, we prove that SGD solutions are connected via a piecewise linear path, and the increase in loss along this path vanishes as the number of neurons grows large. This result is a consequence of the fact that the parameters found by SGD are increasingly dropout stable as the network becomes wider. We show that, if we remove part of the neurons (and suitably rescale the remaining ones), the change in loss is independent of the total number of neurons, and it depends only on how many neurons are left. Our results exhibit a mild dependence on the input dimension: they are dimension-free for two-layer networks and depend linearly on the dimension for multilayer networks. We validate our theoretical findings with numerical experiments for different architectures and classification tasks.

preprint2020arXiv

Sublinear Latency for Simplified Successive Cancellation Decoding of Polar Codes

This work analyzes the latency of the simplified successive cancellation (SSC) decoding scheme for polar codes proposed by Alamdar-Yazdi and Kschischang. It is shown that, unlike conventional successive cancellation decoding, where latency is linear in the block length, the latency of SSC decoding is sublinear. More specifically, the latency of SSC decoding is $O(N^{1-1/μ})$, where $N$ is the block length and $μ$ is the scaling exponent of the channel, which captures the speed of convergence of the rate to capacity. Numerical results demonstrate the tightness of the bound and show that most of the latency reduction arises from the parallel decoding of subcodes of rate $0$ or $1$.

preprint2019arXiv

Rate-Flexible Fast Polar Decoders

Polar codes have gained extensive attention during the past few years and recently they have been selected for the next generation of wireless communications standards (5G). Successive-cancellation-based (SC-based) decoders, such as SC list (SCL) and SC flip (SCF), provide a reasonable error performance for polar codes at the cost of low decoding speed. Fast SC-based decoders, such as Fast-SSC, Fast-SSCL, and Fast-SSCF, identify the special constituent codes in a polar code graph off-line, produce a list of operations, store the list in memory, and feed the list to the decoder to decode the constituent codes in order efficiently, thus increasing the decoding speed. However, the list of operations is dependent on the code rate and as the rate changes, a new list is produced, making fast SC-based decoders not rate-flexible. In this paper, we propose a completely rate-flexible fast SC-based decoder by creating the list of operations directly in hardware, with low implementation complexity. We further propose a hardware architecture implementing the proposed method and show that the area occupation of the rate-flexible fast SC-based decoder in this paper is only $38\%$ of the total area of the memory-based base-line decoder when 5G code rates are supported.

preprint2016arXiv

Comparing the Bit-MAP and Block-MAP Decoding Thresholds of Reed-Muller Codes on BMS Channels

The question whether RM codes are capacity-achieving is a long-standing open problem in coding theory that was recently answered in the affirmative for transmission over erasure channels [1], [2]. Remarkably, the proof does not rely on specific properties of RM codes, apart from their symmetry. Indeed, the main technical result consists in showing that any sequence of linear codes, with doubly-transitive permutation groups, achieves capacity on the memoryless erasure channel under bit-MAP decoding. Thus, a natural question is what happens under block-MAP decoding. In [1], [2], by exploiting further symmetries of the code, the bit-MAP threshold was shown to be sharp enough so that the block erasure probability also converges to 0. However, this technique relies heavily on the fact that the transmission is over an erasure channel. We present an alternative approach to strengthen results regarding the bit-MAP threshold to block-MAP thresholds. This approach is based on a careful analysis of the weight distribution of RM codes. In particular, the flavor of the main result is the following: assume that the bit-MAP error probability decays as $N^{-δ}$, for some $δ>0$. Then, the block-MAP error probability also converges to 0. This technique applies to transmission over any binary memoryless symmetric channel. Thus, it can be thought of as a first step in extending the proof that RM codes are capacity-achieving to the general case.

preprint2016arXiv

Reed-Muller Codes Achieve Capacity on Erasure Channels

We introduce a new approach to proving that a sequence of deterministic linear codes achieves capacity on an erasure channel under maximum a posteriori decoding. Rather than relying on the precise structure of the codes our method exploits code symmetry. In particular, the technique applies to any sequence of linear codes where the blocklengths are strictly increasing, the code rates converge, and the permutation group of each code is doubly transitive. In other words, we show that symmetry alone implies near-optimal performance. An important consequence of this result is that a sequence of Reed-Muller codes with increasing blocklength and converging rate achieves capacity. This possibility has been suggested previously in the literature but it has only been proven for cases where the limiting code rate is 0 or 1. Moreover, these results extend naturally to all affine-invariant codes and, thus, to extended primitive narrow-sense BCH codes. This also resolves, in the affirmative, the existence question for capacity-achieving sequences of binary cyclic codes. The primary tools used in the proof are the sharp threshold property for symmetric monotone boolean functions and the area theorem for extrinsic information transfer functions.

preprint2016arXiv

Unified Scaling of Polar Codes: Error Exponent, Scaling Exponent, Moderate Deviations, and Error Floors

Consider the transmission of a polar code of block length $N$ and rate $R$ over a binary memoryless symmetric channel $W$ and let $P_e$ be the block error probability under successive cancellation decoding. In this paper, we develop new bounds that characterize the relationship of the parameters $R$, $N$, $P_e$, and the quality of the channel $W$ quantified by its capacity $I(W)$ and its Bhattacharyya parameter $Z(W)$. In previous work, two main regimes were studied. In the error exponent regime, the channel $W$ and the rate $R<I(W)$ are fixed, and it was proved that the error probability $P_e$ scales roughly as $2^{-\sqrt{N}}$. In the scaling exponent approach, the channel $W$ and the error probability $P_e$ are fixed and it was proved that the gap to capacity $I(W)-R$ scales as $N^{-1/μ}$. Here, $μ$ is called scaling exponent and this scaling exponent depends on the channel $W$. A heuristic computation for the binary erasure channel (BEC) gives $μ=3.627$ and it was shown that, for any channel $W$, $3.579 \le μ\le 5.702$. Our contributions are as follows. First, we provide the tighter upper bound $μ\le 4.714$ valid for any $W$. With the same technique, we obtain $μ\le 3.639$ for the case of the BEC, which approaches very closely its heuristically derived value. Second, we develop a trade-off between the gap to capacity $I(W)-R$ and the error probability $P_e$ as functions of the block length $N$. In other words, we consider a moderate deviations regime in which we study how fast both quantities, as functions of the block length $N$, simultaneously go to $0$. Third, we prove that polar codes are not affected by error floors. To do so, we fix a polar code of block length $N$ and rate $R$. Then, we vary the channel $W$ and we show that the error probability $P_e$ scales as the Bhattacharyya parameter $Z(W)$ raised to a power that scales roughly like $\sqrt{N}$.

preprint2015arXiv

Reed-Muller Codes Achieve Capacity on the Binary Erasure Channel under MAP Decoding

We show that Reed-Muller codes achieve capacity under maximum a posteriori bit decoding for transmission over the binary erasure channel for all rates $0 < R < 1$. The proof is generic and applies to other codes with sufficient amount of symmetry as well. The main idea is to combine the following observations: (i) monotone functions experience a sharp threshold behavior, (ii) the extrinsic information transfer (EXIT) functions are monotone, (iii) Reed--Muller codes are 2-transitive and thus the EXIT functions associated with their codeword bits are all equal, and (iv) therefore the Area Theorem for the average EXIT functions implies that RM codes' threshold is at channel capacity.

preprint2014arXiv

Achieving Marton's Region for Broadcast Channels Using Polar Codes

This paper presents polar coding schemes for the 2-user discrete memoryless broadcast channel (DM-BC) which achieve Marton's region with both common and private messages. This is the best achievable rate region known to date, and it is tight for all classes of 2-user DM-BCs whose capacity regions are known. To accomplish this task, we first construct polar codes for both the superposition as well as the binning strategy. By combining these two schemes, we obtain Marton's region with private messages only. Finally, we show how to handle the case of common information. The proposed coding schemes possess the usual advantages of polar codes, i.e., they have low encoding and decoding complexity and a super-polynomial decay rate of the error probability. We follow the lead of Goela, Abbe, and Gastpar, who recently introduced polar codes emulating the superposition and binning schemes. In order to align the polar indices, for both schemes, their solution involves some degradedness constraints that are assumed to hold between the auxiliary random variables and the channel outputs. To remove these constraints, we consider the transmission of $k$ blocks and employ a chaining construction that guarantees the proper alignment of the polarized indices. The techniques described in this work are quite general, and they can be adopted to many other multi-terminal scenarios whenever there polar indices need to be aligned.

preprint2014arXiv

From Polar to Reed-Muller Codes: a Technique to Improve the Finite-Length Performance

We explore the relationship between polar and RM codes and we describe a coding scheme which improves upon the performance of the standard polar code at practical block lengths. Our starting point is the experimental observation that RM codes have a smaller error probability than polar codes under MAP decoding. This motivates us to introduce a family of codes that "interpolates" between RM and polar codes, call this family ${\mathcal C}_{\rm inter} = \{C_α : α\in [0, 1]\}$, where $C_α \big |_{α= 1}$ is the original polar code, and $C_α \big |_{α= 0}$ is an RM code. Based on numerical observations, we remark that the error probability under MAP decoding is an increasing function of $α$. MAP decoding has in general exponential complexity, but empirically the performance of polar codes at finite block lengths is boosted by moving along the family ${\mathcal C}_{\rm inter}$ even under low-complexity decoding schemes such as, for instance, belief propagation or successive cancellation list decoder. We demonstrate the performance gain via numerical simulations for transmission over the erasure channel as well as the Gaussian channel.

preprint2014arXiv

Scaling Exponent of List Decoders with Applications to Polar Codes

Motivated by the significant performance gains which polar codes experience under successive cancellation list decoding, their scaling exponent is studied as a function of the list size. In particular, the error probability is fixed and the trade-off between block length and back-off from capacity is analyzed. A lower bound is provided on the error probability under $\rm MAP$ decoding with list size $L$ for any binary-input memoryless output-symmetric channel and for any class of linear codes such that their minimum distance is unbounded as the block length grows large. Then, it is shown that under $\rm MAP$ decoding, although the introduction of a list can significantly improve the involved constants, the scaling exponent itself, i.e., the speed at which capacity is approached, stays unaffected for any finite list size. In particular, this result applies to polar codes, since their minimum distance tends to infinity as the block length increases. A similar result is proved for genie-aided successive cancellation decoding when transmission takes place over the binary erasure channel, namely, the scaling exponent remains constant for any fixed number of helps from the genie. Note that since genie-aided successive cancellation decoding might be strictly worse than successive cancellation list decoding, the problem of establishing the scaling exponent of the latter remains open.

Marco Mondelli

What is connected

Connect this record

See the researcher in context

Building this map preview

21 published item(s)

Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing

Precise Asymptotics for Spectral Methods in Mixed Generalized Linear Models

Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention

Estimation in Rotationally Invariant Generalized Linear Models via Approximate Message Passing

Fundamental Limits of Two-layer Autoencoders, and Achieving Them with Gradient Methods

Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks

Polar Coded Computing: The Role of the Scaling Exponent

Sharp asymptotics on the compression of two-layer neural networks

The price of ignorance: how much does it cost to forget noise structure in low-rank matrix estimation?

Tight Bounds on the Smallest Eigenvalue of the Neural Tangent Kernel for Deep ReLU Networks

Approximate Message Passing with Spectral Initialization for Generalized Linear Models

Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks

Sublinear Latency for Simplified Successive Cancellation Decoding of Polar Codes

Rate-Flexible Fast Polar Decoders

Comparing the Bit-MAP and Block-MAP Decoding Thresholds of Reed-Muller Codes on BMS Channels

Reed-Muller Codes Achieve Capacity on Erasure Channels

Unified Scaling of Polar Codes: Error Exponent, Scaling Exponent, Moderate Deviations, and Error Floors

Reed-Muller Codes Achieve Capacity on the Binary Erasure Channel under MAP Decoding

Achieving Marton's Region for Broadcast Channels Using Polar Codes

From Polar to Reed-Muller Codes: a Technique to Improve the Finite-Length Performance

Scaling Exponent of List Decoders with Applications to Polar Codes