Researcher profile

Giulio Biroli

Giulio Biroli contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
28works
0followers
14topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

28 published item(s)

preprint2026arXiv

High-Dimensional Analysis of Gradient Flow for Extensive-Width Quadratic Neural Networks

We study the high-dimensional training dynamics of a shallow neural network with quadratic activation in a teacher-student setup. We focus on the extensive-width regime, where the teacher and student network widths scale proportionally with the input dimension, and the sample size grows quadratically. This scaling aims to describe overparameterized neural networks in which feature learning still plays a central role. In the high-dimensional limit, we derive a dynamical characterization of the gradient flow, in the spirit of dynamical mean-field theory (DMFT). Under l2-regularization, we analyze these equations at long times and characterize the performance and spectral properties of the resulting estimator. This result provides a quantitative understanding of the effect of overparameterization on learning and generalization, and reveals a double descent phenomenon in the presence of label noise, where generalization improves beyond interpolation. In the small regularization limit, we obtain an exact expression for the perfect recovery threshold as a function of the network widths, providing a precise characterization of how overparameterization influences recovery.

preprint2026arXiv

The critical slowing down in diffusion models

Computational sampling has been central to the sciences since the mid-20th century. While machine-learning-based approaches have recently enabled major advances, their behavior remains poorly understood, with limited theoretical control over when and why they succeed. Here we provide such insight for diffusion models-a class of generative schemes highly effective in practice-by analyzing their application to the $O(n)$ model of statistical field theory in the Gaussian limit $n \to \infty$. In this analytically tractable setting, we show that training a score model with a one-layer network architecture matching the exact solution exhibits a form of critical slowing down in parameter learning. This slowing down also impacts the generation process, indicating that the well-known difficulties of sampling near criticality persist even for learned generative models. To overcome this bottleneck, we demonstrate the power of combining architectural depth with physical locality. We find that using a two-layer architecture drastically reduces the critical slowing down, with the training time scaling logarithmically rather than quadratically with system size. By introducing a local score approximation we show that this acceleration in training time can be achieved without increasing the number of neural network parameters. Taken together, these results demonstrate that diffusion models can overcome the critical slowing down through appropriate architectural design, and establish a controlled framework for understanding and improving learned sampling methods in statistical physics and beyond.

preprint2023arXiv

Landscape Complexity for the Empirical Risk of Generalized Linear Models

We present a method to obtain the average and the typical value of the number of critical points of the empirical risk landscape for generalized linear estimation problems and variants. This represents a substantial extension of previous applications of the Kac-Rice method since it allows to analyze the critical points of high dimensional non-Gaussian random functions. Under a technical hypothesis, we obtain a rigorous explicit variational formula for the annealed complexity, which is the logarithm of the average number of critical points at fixed value of the empirical risk. This result is simplified, and extended, using the non-rigorous Kac-Rice replicated method from theoretical physics. In this way we find an explicit variational formula for the quenched complexity, which is generally different from its annealed counterpart, and allows to obtain the number of critical points for typical instances up to exponential accuracy.

preprint2022arXiv

Equilibrium Fluctuations in Mean-field Disordered Models

Mean-field models of glasses that present a random first order transition exhibit highly non-trivial fluctuations. Building on previous studies that focused on the critical scaling regime, we here obtain a fully quantitative framework for all equilibrium conditions. By means of the replica method we evaluate Gaussian fluctuations of the overlaps around the thermodynamic limit, decomposing them in thermal fluctuations inside each state and heterogeneous fluctuations between different states. We first test and compare our analytical results with numerical simulation results for the p-spin spherical model and the random orthogonal model, and then analyze the random Lorentz gas. In all cases, a strong quantitative agreement is obtained. Our analysis thus provides a robust scheme for identifying the key finite-size (or finite-dimensional) corrections to the mean-field treatment of these paradigmatic glass models.

preprint2022arXiv

Local dynamical heterogeneity in glass formers

We study the local dynamical fluctuations in glass-forming models of particles embedded in $d$-dimensional space, in the mean-field limit of $d\to\infty$. Our analytical calculation reveals that single-particle observables, such as squared particle displacements, display divergent fluctuations around the dynamical (or mode-coupling) transition, due to the emergence of nontrivial correlations between displacements along different directions. This effect notably gives rise to a divergent non-Gaussian parameter, $α_2$. The $d\to\infty$ local dynamics therefore becomes quite rich upon approaching the glass transition. The finite-$d$ remnant of this phenomenon further provides a long sought-after, first-principle explanation for the growth of $α_2$ around the glass transition that is \emph{not based on multi-particle correlations}.

preprint2022arXiv

Optimal learning rate schedules in high-dimensional non-convex optimization problems

Learning rate schedules are ubiquitously used to speed up and improve optimisation. Many different policies have been introduced on an empirical basis, and theoretical analyses have been developed for convex settings. However, in many realistic problems the loss-landscape is high-dimensional and non convex -- a case for which results are scarce. In this paper we present a first analytical study of the role of learning rate scheduling in this setting, focusing on Langevin optimization with a learning rate decaying as $η(t)=t^{-β}$. We begin by considering models where the loss is a Gaussian random function on the $N$-dimensional sphere ($N\rightarrow \infty$), featuring an extensive number of critical points. We find that to speed up optimization without getting stuck in saddles, one must choose a decay rate $β<1$, contrary to convex setups where $β=1$ is generally optimal. We then add to the problem a signal to be recovered. In this setting, the dynamics decompose into two phases: an \emph{exploration} phase where the dynamics navigates through rough parts of the landscape, followed by a \emph{convergence} phase where the signal is detected and the dynamics enter a convex basin. In this case, it is optimal to keep a large learning rate during the exploration phase to escape the non-convex region as quickly as possible, then use the convex criterion $β=1$ to converge rapidly to the solution. Finally, we demonstrate that our conclusions hold in a common regression task involving neural networks.

preprint2022arXiv

Rare events and disorder control the brittle yielding of well-annealed amorphous solids

We use atomistic computer simulations to provide a microscopic description of the brittle failure of amorphous materials, and we assess the role of rare events and quenched disorder. We argue that brittle yielding originates at rare soft regions, similarly to Griffiths effects in disordered systems. We numerically demonstrate how localized plastic events in such soft regions trigger macroscopic failure via the propagation of a shear band. This physical picture, which no longer holds in poorly annealed ductile materials, allows us to discuss the role of finite size effects in brittle yielding and reinforces the similarities between yielding and other disorder-controlled nonequilibrium phase transitions.

preprint2022arXiv

The RFOT Theory of Glasses: Recent Progress and Open Issues

The Random First Order Transition (RFOT) theory started with the pioneering work of Kirkpatrick, Thirumalai and Wolynes. It leverages the methods and advances of the theory of disordered systems. It fares remarkably well at reproducing the salient experimental facts of super-cooled liquids. Yet, direct and indisputable experimental validations are missing. In this short survey, we will review recent investigations that broadly support all static aspects of RFOT, but also those for which the standard dynamical extension of the theory appears to be struggling, in particular in relation with facilitation effects. We discuss possible solutions and open issues.

preprint2021arXiv

Amorphous Order & Non-linear Susceptibilities in Glassy Materials

We review 15 years of theoretical and experimental work on the non-linear response of glassy systems. We argue that an anomalous growth of the peak value of non-linear susceptibilities is a signature of growing &#34;amorphous order&#34; in the system, with spin-glasses as a case in point. Experimental results on supercooled liquids are fully compatible with the RFOT prediction of compact &#34;glassites&#34; of increasing volume as temperature is decreased, or as the system ages. We clarify why such a behaviour is hard to explain within purely kinetic theories of glass formation, despite recent claims to the contrary.

preprint2021arXiv

Critical behavior of the Anderson model on the Bethe lattice via a large-deviation approach

We present a new large-deviation approach to investigate the critical properties of the Anderson model on the Bethe lattice close to the localization transition in the thermodynamic limit. Our method allows us to study accurately the distribution of the local density of states (LDoS) down to very small probability tails as small as $10^{-50}$ which are completely out of reach for standard numerical techniques. We perform a thorough analysis of the functional form and of the tails of the probability distribution of the LDoS which yields for the first time a direct, transparent, and precise estimation of the correlation volume close to the Anderson transition. Such correlation volume is found to diverge exponentially when the localization is approached from the delocalized regime, in a singular way that is in agreement with the analytic predictions of the supersymmetric treatment.

preprint2021arXiv

Effects of intraspecific cooperative interactions in large ecosystems

We analyze the role of the Allee effect, a positive correlation between population density and mean individual fitness, for ecological communities formed by a large number of species. Our study is performed using the generalized Lotka-Volterra model with random interactions between species. We obtain the phase diagram and analyze the nature of the multiple equilibria phase. Remarkable differences emerge with respect to the case of the logistic growth case, thus revealing the major role played by the functional response in determining aggregate behaviours of ecosystems.

preprint2020arXiv

An analytic theory of shallow networks dynamics for hinge loss classification

Neural networks have been shown to perform incredibly well in classification tasks over structured high-dimensional datasets. However, the learning dynamics of such networks is still poorly understood. In this paper we study in detail the training dynamics of a simple type of neural network: a single hidden layer trained to perform a classification task. We show that in a suitable mean-field limit this case maps to a single-node learning problem with a time-dependent dataset determined self-consistently from the average nodes population. We specialize our theory to the prototypical case of a linearly separable dataset and a linear hinge loss, for which the dynamics can be explicitly solved. This allow us to address in a simple setting several phenomena appearing in modern networks such as slowing down of training dynamics, crossover between rich and lazy learning, and overfitting. Finally, we asses the limitations of mean-field theory by studying the case of large but finite number of nodes and of training samples.

preprint2020arXiv

Anomalous dynamics in the ergodic side of the Many-Body Localization transition and the glassy phase of Directed Polymers in Random Media

Using the non-interacting Anderson tight-binding model on the Bethe lattice as a toy model for the many-body quantum dynamics, we propose a novel and transparent theoretical explanation of the anomalously slow dynamics that emerges in the bad metal phase preceding the Many-Body Localization transition. By mapping the time-decorrelation of many-body wave-functions onto Directed Polymers in Random Media, we show the existence of a glass transition within the extended regime separating a metallic-like phase at small disorder, where delocalization occurs on an exponential number of paths, from a bad metal-like phase at intermediate disorder, where resonances are formed on rare, specific, disorder dependent site orbitals on very distant generations. The physical interpretation of subdiffusion and non-exponential relaxation emerging from this picture is complementary to the Griffiths one, although both scenarios rely on the presence of heavy-tailed distribution of the escape times. We relate the dynamical evolution in the glassy phase to the depinning transition of Directed Polymers, which results in macroscopic and abrupt jumps of the preferred delocalizing paths when a parameter like the energy is varied, and produce a singular behavior of the overlap correlation function between eigenstates at different energies. By comparing the quantum dynamics on loop-less Cayley trees and Random Regular Graphs we discuss the effect of loops, showing that in the latter slow dynamics and apparent power-laws extend on a very large time-window but are eventually cut-off on a time-scale that diverges at the MBL transition.

preprint2020arXiv

Attractive versus truncated repulsive supercooled liquids: The dynamics is encoded in the pair correlation function

We compare glassy dynamics in two liquids that differ in the form of their interaction potentials. Both systems have the same repulsive interactions but one has also an attractive part in the potential. These two systems exhibit very different dynamics despite having nearly identical pair correlation functions. We demonstrate that a properly weighted integral of the pair correlation function, which amplifies the subtle differences between the two systems, correctly captures their dynamical differences. The weights are obtained from a standard machine learning algorithm.

preprint2020arXiv

Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval

Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimension is small the dynamics remains trapped in spurious minima with large basins of attraction. We find analytically that above a critical ratio those critical points become unstable developing a negative direction toward the signal. By numerical experiments we show that in this regime the gradient flow algorithm is not trapped; it drifts away from the spurious critical points along the unstable direction and succeeds in finding the global minimum. Using tools from statistical physics we characterize this phenomenon, which is related to a BBP-type transition in the Hessian of the spurious minima.

preprint2020arXiv

Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime

Deep neural networks can achieve remarkable generalization performances while interpolating the training data perfectly. Rather than the U-curve emblematic of the bias-variance trade-off, their test error often follows a &#34;double descent&#34; - a mark of the beneficial role of overparametrization. In this work, we develop a quantitative theory for this phenomenon in the so-called lazy learning regime of neural networks, by considering the problem of learning a high-dimensional function with random features regression. We obtain a precise asymptotic expression for the bias-variance decomposition of the test error, and show that the bias displays a phase transition at the interpolation threshold, beyond which it remains constant. We disentangle the variances stemming from the sampling of the dataset, from the additive noise corrupting the labels, and from the initialization of the weights. Following up on Geiger et al. 2019, we first show that the latter two contributions are the crux of the double descent: they lead to the overfitting peak at the interpolation threshold and to the decay of the test error upon overparametrization. We then quantify how they are suppressed by ensemble averaging the outputs of K independently initialized estimators. When K is sent to infinity, the test error remains constant beyond the interpolation threshold. We further compare the effects of overparametrizing, ensembling and regularizing. Finally, we present numerical experiments on classic deep learning setups to show that our results hold qualitatively in realistic lazy learning scenarios.

preprint2020arXiv

Dynamical Mean-Field Theory and Aging Dynamics

Dynamical Mean-Field Theory (DMFT) replaces the many-body dynamical problem with one for a single degree of freedom in a thermal bath whose features are determined self-consistently. By focusing on models with soft disordered $p$-spin interactions, we show how to incorporate the mean-field theory of aging within dynamical mean-field theory. We study cases with only one slow time-scale, corresponding statically to the one-step replica symmetry breaking (1RSB) phase, and cases with an infinite number of slow time-scales, corresponding statically to the full replica symmetry breaking (FRSB) phase. For the former, we show that the effective temperature of the slow degrees of freedom is fixed by requiring critical dynamical behavior on short time-scales, i.e. marginality. For the latter, we find that aging on an infinite number of slow time-scales is governed by a stochastic equation where the clock for dynamical evolution is fixed by the change of effective temperature, hence obtaining a dynamical derivation of the stochastic equation at the basis of the FRSB phase. Our results extend the realm of the mean-field theory of aging to all situations where DMFT holds.

preprint2020arXiv

Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias

Despite the phenomenal success of deep neural networks in a broad range of learning tasks, there is a lack of theory to understand the way they work. In particular, Convolutional Neural Networks (CNNs) are known to perform much better than Fully-Connected Networks (FCNs) on spatially structured data: the architectural structure of CNNs benefits from prior knowledge on the features of the data, for instance their translation invariance. The aim of this work is to understand this fact through the lens of dynamics in the loss landscape. We introduce a method that maps a CNN to its equivalent FCN (denoted as eFCN). Such an embedding enables the comparison of CNN and FCN training dynamics directly in the FCN space. We use this method to test a new training protocol, which consists in training a CNN, embedding it to FCN space at a certain ``relax time&#39;&#39;, then resuming the training in FCN space. We observe that for all relax times, the deviation from the CNN subspace is small, and the final performance reached by the eFCN is higher than that reachable by a standard FCN of same architecture. More surprisingly, for some intermediate relax times, the eFCN outperforms the CNN it stemmed, by combining the prior information of the CNN and the expressivity of the FCN in a complementary way. The practical interest of our protocol is limited by the very large size of the highly sparse eFCN. However, it offers interesting insights into the persistence of architectural bias under stochastic gradient dynamics. It shows the existence of some rare basins in the FCN loss landscape associated with very good generalization. These can only be accessed thanks to the CNN prior, which helps navigate the landscape during the early stages of optimization.

preprint2020arXiv

How to iron out rough landscapes and get optimal performances: Averaged Gradient Descent and its application to tensor PCA

In many high-dimensional estimation problems the main task consists in minimizing a cost function, which is often strongly non-convex when scanned in the space of parameters to be estimated. A standard solution to flatten the corresponding rough landscape consists in summing the losses associated to different data points and obtain a smoother empirical risk. Here we propose a complementary method that works for a single data point. The main idea is that a large amount of the roughness is uncorrelated in different parts of the landscape. One can then substantially reduce the noise by evaluating an empirical average of the gradient obtained as a sum over many random independent positions in the space of parameters to be optimized. We present an algorithm, called Averaged Gradient Descent, based on this idea and we apply it to tensor PCA, which is a very hard estimation problem. We show that Averaged Gradient Descent over-performs physical algorithms such as gradient descent and approximate message passing and matches the best algorithmic thresholds known so far, obtained by tensor unfolding and methods based on sum-of-squares.

preprint2020arXiv

Marvels and Pitfalls of the Langevin Algorithm in Noisy High-dimensional Inference

Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work we perform an analytic study of the performances of one of them, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked matrix-tensor model. The typical behaviour of this algorithm is described by a system of integro-differential equations that we call the Langevin state evolution, whose solution is compared with the one of the state evolution of approximate message passing (AMP). Our results show that, remarkably, the algorithmic threshold of the Langevin algorithm is sub-optimal with respect to the one given by AMP. We conjecture this phenomenon to be due to the residual glassiness present in that region of parameters. Finally we show how a landscape-annealing protocol, that uses the Langevin algorithm but violate the Bayes-optimality condition, can approach the performance of AMP.

preprint2020arXiv

Role of fluctuations in the yielding transition of two-dimensional glasses

We numerically study yielding in two-dimensional glasses which are generated with a very wide range of stabilities by swap Monte-Carlo simulations and then slowly deformed at zero temperature. We provide strong numerical evidence that stable glasses yield via a nonequilibrium discontinuous transition in the thermodynamic limit. A critical point separates this brittle yielding from the ductile one observed in less stable glasses. We find that two-dimensional glasses yield similarly to their three-dimensional counterparts but display larger sample-to-sample disorder-induced fluctuations, stronger finite-size effects, and rougher spatial wandering of the observed shear bands. These findings strongly constrain effective theories of yielding.

preprint2020arXiv

Searching for the Gardner transition in glassy glycerol

We search for a Gardner transition in glassy glycerol, a standard molecular glass, measuring the third harmonics cubic susceptibility $χ_3^{(3)}$ from slightly below the usual glass transition temperature down to $10K$. According to the mean field picture, if local motion within the glass were becoming highly correlated due to the emergence of a Gardner phase then $χ_3^{(3)}$, which is analogous to the dynamical spin-glass susceptibility, should increase and diverge at the Gardner transition temperature $T_G$. We find instead that upon cooling $| χ_3^{(3)} |$ decreases by several orders of magnitude and becomes roughly constant in the regime $100K-10K$. We rationalize our findings by assuming that the low temperature physics is described by localized excitations weakly interacting via a spin-glass dipolar pairwise interaction in a random magnetic field. Our quantitative estimations show that the spin-glass interaction is twenty to fifty times smaller than the local random field contribution, thus rationalizing the absence of the spin-glass Gardner phase. This hints at the fact that a Gardner phase may be suppressed in standard molecular glasses, but it also suggests ways to favor its existence in other amorphous solids and by changing the preparation protocol.

preprint2020arXiv

Triple descent and the two kinds of overfitting: Where & why do they appear?

A recent line of research has highlighted the existence of a &#34;double descent&#34; phenomenon in deep learning, whereby increasing the number of training examples $N$ causes the generalization error of neural networks to peak when $N$ is of the same order as the number of parameters $P$. In earlier works, a similar phenomenon was shown to exist in simpler models such as linear regression, where the peak instead occurs when $N$ is equal to the input dimension $D$. Since both peaks coincide with the interpolation threshold, they are often conflated in the litterature. In this paper, we show that despite their apparent similarity, these two scenarios are inherently different. In fact, both peaks can co-exist when neural networks are applied to noisy regression tasks. The relative size of the peaks is then governed by the degree of nonlinearity of the activation function. Building on recent developments in the analysis of random feature models, we provide a theoretical ground for this sample-wise triple descent. As shown previously, the nonlinear peak at $N\!=\!P$ is a true divergence caused by the extreme sensitivity of the output function to both the noise corrupting the labels and the initialization of the random features (or the weights in neural networks). This peak survives in the absence of noise, but can be suppressed by regularization. In contrast, the linear peak at $N\!=\!D$ is solely due to overfitting the noise in the labels, and forms earlier during training. We show that this peak is implicitly regularized by the nonlinearity, which is why it only becomes salient at high noise and is weakly affected by explicit regularization. Throughout the paper, we compare analytical results obtained in the random feature model with the outcomes of numerical experiments involving deep neural networks.

preprint2020arXiv

Who is Afraid of Big Bad Minima? Analysis of Gradient-Flow in a Spiked Matrix-Tensor Model

Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones. Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model. Our framework is based on the Kac-Rice analysis of stationary points and a closed-form analysis of gradient-flow originating from statistical physics. We show that there is a well defined region of parameters where the gradient-flow algorithm finds a good global minimum despite the presence of exponentially many spurious local minima. We show that this is achieved by surfing on saddles that have strong negative direction towards the global minima, a phenomenon that is connected to a BBP-type threshold in the Hessian describing the critical points of the landscapes.

preprint2019arXiv

A jamming transition from under- to over-parametrization affects loss landscape and generalization

We argue that in fully-connected networks a phase transition delimits the over- and under-parametrized regimes where fitting can or cannot be achieved. Under some general conditions, we show that this transition is sharp for the hinge loss. In the whole over-parametrized regime, poor minima of the loss are not encountered during training since the number of constraints to satisfy is too small to hamper minimization. Our findings support a link between this transition and the generalization properties of the network: as we increase the number of parameters of a given model, starting from an under-parametrized network, we observe that the generalization error displays three phases: (i) initial decay, (ii) increase until the transition point --- where it displays a cusp --- and (iii) slow decay toward a constant for the rest of the over-parametrized regime. Thereby we identify the region where the classical phenomenon of over-fitting takes place, and the region where the model keeps improving, in line with previous empirical observations for modern neural networks.

preprint2019arXiv

Numerical implementation of dynamical mean field theory for disordered systems: application to the Lotka-Volterra model of ecosystems

Dynamical mean field theory (DMFT) is a tool that allows to analyze the stochastic dynamics of $N$ interacting degrees of freedom in terms of a self-consistent $1$-body problem. In this work, focusing on models of ecosystems, we present the derivation of DMFT through the dynamical cavity method, and we develop a method for solving it numerically. Our numerical procedure can be applied to a large variety of systems for which DMFT holds. We implement and test it for the generalized random Lotka-Volterra model, and show that complex dynamical regimes characterized by chaos and aging can be captured and studied by this framework.

preprint2019arXiv

Scaling description of generalization with number of parameters in deep learning

Supervised deep learning involves the training of neural networks with a large number $N$ of parameters. For large enough $N$, in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as $N$ grows past a certain threshold $N^{*}$. Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with $N$. We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations $\|f_{N}-\bar{f}_{N}\|\sim N^{-1/4}$ of the neural net output function $f_{N}$ around its expectation $\bar{f}_{N}$. These affect the generalization error $ε_{N}$ for classification: under natural assumptions, it decays to a plateau value $ε_{\infty}$ in a power-law fashion $\sim N^{-1/2}$. This description breaks down at a so-called jamming transition $N=N^{*}$. At this threshold, we argue that $\|f_{N}\|$ diverges. This result leads to a plausible explanation for the cusp in test error known to occur at $N^{*}$. Our results are confirmed by extensive empirical observations on the MNIST and CIFAR image datasets. Our analysis finally suggests that, given a computational envelope, the smallest generalization error is obtained using several networks of intermediate sizes, just beyond $N^{*}$, and averaging their outputs.