Source author record

Boris Hanin

Boris Hanin appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math-ph math.MP math.PR math.SP Machine Learning math.DG math.AP math.CV Artificial Intelligence hep-th math.FA math.NA Numerical Analysis

Catalog footprint

What is connected

17works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Hyperparameter Transfer for Dense Associative Memories

Dense Associative Memory (DenseAM) is a promising family of AI architectures that is represented by a neural network performing temporal dynamics on an energy landscape. While hyperparameter transfer methods are well-studied for feed-forward networks, these methods have not been developed for settings in which weights are shared across layers and within the layer, which is common in DenseAMs. Additionally, DenseAMs utilize rapidly peaking activation functions that are rarely used in feed-forward architectures. The confluence of these aspects makes DenseAM a challenging framework for using existing methods for hyperparameter transfer. Our work initiates the development of hyperparameter transfer methods for this class of models. We derive explicit prescriptions for how the hyperparameters tuned on small models can be transferred to models trained at scale. We demonstrate excellent agreement between these theoretical findings and empirical results.

preprint2026arXiv

Learning Rate Transfer in Normalized Transformers

The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup. However, despite having hyperparameters that explicitly scale with model size, we observe that nGPT does not exhibit learning rate transfer across model dimension and token horizon. To rectify this, we combine numerical experiments with a principled use of alignment exponents (arXiv:2407.05872) to revisit and modify the $μ$P approach to hyperparameter transfer (arXiv:2011.14522). The result is a novel nGPT parameterization we call $ν$GPT. Through extensive empirical validation, we find $ν$GPT exhibits learning rate transfer across width, depth, and token horizon.

preprint2023arXiv

Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies

This article considers fully connected neural networks with Gaussian random weights and biases as well as $L$ hidden layers, each of width proportional to a large parameter $n$. For polynomially bounded non-linearities we give sharp estimates in powers of $1/n$ for the joint cumulants of the network output and its derivatives. Moreover, we show that network cumulants form a perturbatively solvable hierarchy in powers of $1/n$ in that $k$-th order cumulants in one layer have recursions that depend to leading order in $1/n$ only on $j$-th order cumulants at the previous layer with $j\leq k$. By solving a variety of such recursions, however, we find that the depth-to-width ratio $L/n$ plays the role of an effective network depth, controlling both the scale of fluctuations at individual neurons and the size of inter-neuron correlations. Thus, while the cumulant recursions we derive form a hierarchy in powers of $1/n$, contributions of order $1/n^k$ often grow like $L^k$ and are hence non-negligible at positive $L/n$. We use this to study a somewhat simplified version of the exploding and vanishing gradient problem, proving that this particular variant occurs if and only if $L/n$ is large. Several key ideas in this article were first developed at a physics level of rigor in a recent monograph of Daniel A. Roberts, Sho Yaida, and the author. This article not only makes these ideas mathematically precise but also significantly extends them, opening the way to obtaining corrections to all orders in $1/n$.

preprint2021arXiv

The Principles of Deep Learning Theory

This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.

preprint2020arXiv

Local Universality for Zeros and Critical Points of Monochromatic Random Waves

This paper concerns the asymptotic behavior of zeros and critical points for monochromatic random waves $ϕ_λ$ of frequency $λ$ on a compact, smooth, Riemannian manifold $(M,g)$ as $λ\rightarrow \infty$. We prove that the measure of integration over the zero set of $ϕ_λ$ restricted to balls of radius $\approx λ^{-1}$ converges in distribution to the measure of integration over the zero set of a frequency $1$ random wave on $\mathbb R^n$, where $n$ is the dimension of $M$. We also prove convergence of finite moments for the counting measure of the critical points of ϕλ, again restricted to balls of radius $\approx λ^{-1}$, to the corresponding moments for frequency $1$ random waves. We then patch together these local results to obtain new global variance estimates on the volume of the zero set and numbers of critical points of $ϕ_λ$ on all of $M.$ Our local results hold under conditions about the structure of geodesics on $M$ that are generic in the space of all metrics on $M$, while our global results hold whenever $(M,g)$ has no conjugate points (e.g is negatively curved).

preprint2020arXiv

Neural Network Approximation

Neural Networks (NNs) are the method of choice for building learning algorithms. Their popularity stems from their empirical success on several challenging learning problems. However, most scholars agree that a convincing theoretical explanation for this success is still lacking. This article surveys the known approximation properties of the outputs of NNs with the aim of uncovering the properties that are not present in the more traditional methods of approximation used in numerical analysis. Comparisons are made with traditional approximation methods from the viewpoint of rate distortion. Another major component in the analysis of numerical approximation is the computational time needed to construct the approximation and this in turn is intimately connected with the stability of the approximation algorithm. So the stability of numerical approximation using NNs is a large part of the analysis put forward. The survey, for the most part, is concerned with NNs using the popular ReLU activation function. In this case, the outputs of the NNs are piecewise linear functions on rather complicated partitions of the domain of $f$ into cells that are convex polytopes. When the architecture of the NN is fixed and the parameters are allowed to vary, the set of output functions of the NN is a parameterized nonlinear manifold. It is shown that this manifold has certain space filling properties leading to an increased ability to approximate (better rate distortion) but at the expense of numerical stability. The space filling creates a challenge to the numerical method in finding best or good parameter choices when trying to approximate.

preprint2018arXiv

Products of Many Large Random Matrices and Gradients in Deep Neural Networks

We study products of random matrices in the regime where the number of terms and the size of the matrices simultaneously tend to infinity. Our main theorem is that the logarithm of the $\ell_2$ norm of such a product applied to any fixed vector is asymptotically Gaussian. The fluctuations we find can be thought of as a finite temperature correction to the limit in which first the size and then the number of matrices tend to infinity. Depending on the scaling limit considered, the mean and variance of the limiting Gaussian depend only on either the first two or the first four moments of the measure from which matrix entries are drawn. We also obtain explicit error bounds on the moments of the norm and the Kolmogorov-Smirnov distance to a Gaussian. Finally, we apply our result to obtain precise information about the stability of gradients in randomly initialized deep neural networks with ReLU activations. This provides a quantitative measure of the extent to which the exploding and vanishing gradient problem occurs in a fully connected neural network with ReLU activations and a given architecture.

preprint2016arXiv

C-infinity Scaling Asymptotics for the Spectral Function of the Laplacian

This article concerns new off-diagonal estimates on the remainder and its derivatives in the pointwise Weyl law on a compact n-dimensional Riemannian manifold. As an application, we prove that near any non self-focal point, the scaling limit of the spectral projector of the Laplacian onto frequency windows of constant size is a normalized Bessel function depending only on n.

preprint2016arXiv

Nodal Sets of Smooth Functions with Finite Vanishing Order and p-Sweepouts

We show that on a compact Riemmanian manifold $(M,g)$, nodal sets of linear combinations of any $p+1$ smooth functions form an admissible $p-$sweepout provided these linear combinations have uniformly bounded vanishing order. This applies in particular to finite linear combinations of Laplace eigenfunctions. As a result, we obtain a new proof of the Gromov, Guth, Marques--Neves upper bounds on the min-max $p$-widths of $M.$ We also prove that close to a point at which a smooth function on $\mathbb{R}^{n+1}$ vanishes to order $k$, its nodal set is contained in the union of $k$ $W^{1,p}$ graphs for some $p > 1$. This implies that the nodal set is locally countably $n$-rectifiable and has locally finite $\mathcal{H}^n$ measure, facts which also follow from a previous result of Bär. Finally, we prove the continuity of the Hausdorff measure of nodal sets under heat flow.

preprint2016arXiv

Pairing of Zeros and Critical Points for Random Polynomials

Let p_N be a random degree N polynomial in one complex variable whose zeros are chosen independently from a fixed probability measure mu on the Riemann sphere S^2. This article proves that if we condition p_N to have a zero at some fixed point xi in , then, with high probability, there will be a critical point w_xi a distance 1/N away from xi. This 1/N distance is much smaller than the one over root N typical spacing between nearest neighbors for N i.i.d. points on S^2. Moreover, with the same high probability, the argument of w_xi relative to xi is a deterministic function of mu plus fluctuations on the order of 1/N.

preprint2016arXiv

Scaling of Harmonic Oscillator Eigenfunctions and Their Nodal Sets Around the Caustic

We study the scaling asymptotics of the eigenspace projection kernels $Π_{\hbar, E}(x,y)$ of the isotropic Harmonic Oscillator $- \hbar ^2 Δ+ |x|^2$ of eigenvalue $E = \hbar(N + \frac{d}{2})$ in the semi-classical limit $\hbar \to 0$. The principal result is an explicit formula for the scaling asymptotics of $Π_{\hbar, E}(x,y)$ for $x,y$ in a $\hbar^{2/3}$ neighborhood of the caustic $\mathcal C_E$ as $\hbar \to 0.$ The scaling asymptotics are applied to the distribution of nodal sets of Gaussian random eigenfunctions around the caustic as $\hbar \to 0$. In previous work we proved that the density of zeros of Gaussian random eigenfunctions of $\hat{H}_{\hbar}$ have different orders in the Planck constant $\hbar$ in the allowed and forbidden regions: In the allowed region the density is of order $\hbar^{-1}$ while it is $\hbar^{-1/2}$ in the forbidden region. Our main result on nodal sets is that the density of zeros is of order $\hbar^{-\frac{2}{3}}$ in an $\hbar^{\frac{2}{3}}$-tube around the caustic. This tube radius is the `critical radius'. For annuli of larger inner and outer radii $\hbar^α$ with $0< α< \frac{2}{3}$ we obtain density results which interpolate between this critical radius result and our prior ones in the allowed and forbidden region. We also show that the Hausdorff $(d-2)$-dimensional measure of the intersection of the nodal set with the caustic is of order $\hbar^{- \frac{2}{3}}$.

preprint2015arXiv

Scaling Limit for the Kernel of the Spectral Projector and Remainder Estimates in the Pointwise Weyl Law

Let (M, g) be a compact smooth Riemannian manifold. We obtain new off-diagonal estimates as λ tend to infinity for the remainder in the pointwise Weyl Law for the kernel of the spectral projector of the Laplacian onto functions with frequency at most λ. A corollary is that, when rescaled around a non self-focal point, the kernel of the spectral projector onto the frequency interval (λ, λ+ 1] has a universal scaling limit as λ goes to infinity (depending only on the dimension of M). Our results also imply that if M has no conjugate points, then immersions of M into Euclidean space by an orthonormal basis of eigenfunctions with frequencies in (λ, λ+ 1] are embeddings for all λ sufficiently large.

preprint2014arXiv

High Frequency Eigenfunction Immersions and Supremum Norms of Random Waves

A compact Riemannian manifold may be immersed into Euclidean space by using high frequency Laplace eigenfunctions. We study the geometry of the manifold viewed as a metric space endowed with the distance function from the ambient Euclidean space. As an application we give a new proof of a result of Burq-Lebeau and others on upper bounds for the sup-norms of random linear combinations of high frequency eigenfunctions.

preprint2014arXiv

Mean of the $L^\infty$-norm for $L^2$-normalized random waves on compact aperiodic Riemannian manifolds

This article concerns upper bounds for $L^\infty$-norms of random approximate eigenfunctions of the Laplace operator on a compact aperiodic Riemannian manifold $(M,g).$ We study $f_λ$ chosen uniformly at random from the space of $L^2$-normalized linear combinations of Laplace eigenfunctions with eigenvalues in the interval $(λ^2, \lr{λ+1}^2].$ Our main result is that the expected value of $\norm{f_λ}_\infty$ grows at most like $C \sqrt{\log λ}$ as $λ\to \infty$, where $C$ is an explicit constant depending only on the dimension and volume of $(M,g).$ In addition, we obtain concentration of the $L^\infty$-norm around its mean and median and study the analogous problems for Gaussian random waves on $(M,g).$

preprint2013arXiv

Nodal Sets of Random Eigenfunctions for the Isotropic Harmonic Oscillator

We consider Gaussian random eigenfunctions (Hermite functions) of fixed energy level of the isotropic semi-classical Harmonic Oscillator on ${\bf R}^n$. We calculate the expected density of zeros of a random eigenfunction in the semi-classical limit $h \to 0.$ In the allowed region the density is of order $h^{-1},$ while in the forbidden region the density is of order $h^{-\frac{1}{2}}$. The computer graphics due to E.J. Heller illustrate this difference in "frequency" between the allowed and forbidden nodal sets.

preprint2013arXiv

Pairing of Zeros and Critical Points for Random Meromorphic Functions on Riemann Surfaces

We prove that zeros and critical points of a random polynomial $p_N$ of degree $N$ in one complex variable appear in pairs. More precisely, if $p_N$ is conditioned to have $p_N(ξ)=0$ for a fixed $ξ\in \C\backslash\set{0},$ we prove that there is a unique critical point z in the annulus $N^{-1-\ep}<\abs{z-ξ}< N^{-1+\ep}}$ and no critical points closer to $ξ$ with probability at least $1-O(N^{-3/2+3\ep}).$ We also prove an analogous statement in the more general setting of random meromorphic functions on a closed Riemann surface.

preprint2012arXiv

Correlations and Pairing Between Zeros and Critical Points of Gaussian Random Polynomials

We study the asymptotics of correlations and nearest neighbor spacings between zeros and holomorphic critical points of $p_N$, a degree N Hermitian Gaussian random polynomial in the sense of Shiffman and Zeldtich, as N goes to infinity. By holomorphic critical point we mean a solution to the equation $\frac{d}{dz}p_N(z)=0.$ Our principal result is an explicit asymptotic formula for the local scaling limit of $\E{Z_{p_N}\wedge C_{p_N}},$ the expected joint intensity of zeros and critical points, around any point on the Riemann sphere. Here $Z_{p_N}$ and $C_{p_N}$ are the currents of integration (i.e. counting measures) over the zeros and critical points of $p_N$, respectively. We prove that correlations between zeros and critical points are short range, decaying like $e^{-N\abs{z-w}^2}.$ With $\abs{z-w}$ on the order of $N^{-1/2},$ however, $\E{Z_{p_N}\wedge C_{p_N}}(z,w)$ is sharply peaked near $z=w,$ causing zeros and critical points to appear in rigid pairs. We compute tight bounds on the expected distance and angular dependence between a critical point and its paired zero.

Boris Hanin

What is connected

Connect this record

See the researcher in context

Building this map preview

17 published item(s)

Hyperparameter Transfer for Dense Associative Memories

Learning Rate Transfer in Normalized Transformers

Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies

The Principles of Deep Learning Theory

Local Universality for Zeros and Critical Points of Monochromatic Random Waves

Neural Network Approximation

Products of Many Large Random Matrices and Gradients in Deep Neural Networks

C-infinity Scaling Asymptotics for the Spectral Function of the Laplacian

Nodal Sets of Smooth Functions with Finite Vanishing Order and p-Sweepouts

Pairing of Zeros and Critical Points for Random Polynomials

Scaling of Harmonic Oscillator Eigenfunctions and Their Nodal Sets Around the Caustic

Scaling Limit for the Kernel of the Spectral Projector and Remainder Estimates in the Pointwise Weyl Law

High Frequency Eigenfunction Immersions and Supremum Norms of Random Waves

Mean of the $L^\infty$-norm for $L^2$-normalized random waves on compact aperiodic Riemannian manifolds

Nodal Sets of Random Eigenfunctions for the Isotropic Harmonic Oscillator

Pairing of Zeros and Critical Points for Random Meromorphic Functions on Riemann Surfaces

Correlations and Pairing Between Zeros and Critical Points of Gaussian Random Polynomials