Source author record

Jeffrey Pennington

Jeffrey Pennington appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning hep-th hep-ph math.OC Neural and Evolutionary Computing math.PR math.ST Statistics Theory Artificial Intelligence

Catalog footprint

What is connected

18works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties

We develop a stochastic differential equation, called homogenized SGD, for analyzing the dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares problem with $\ell^2$-regularization. We show that homogenized SGD is the high-dimensional equivalence of SGD -- for any quadratic statistic (e.g., population risk with quadratic loss), the statistic under the iterates of SGD converges to the statistic under homogenized SGD when the number of samples $n$ and number of features $d$ are polynomially related ($d^c < n < d^{1/c}$ for some $c > 0$). By analyzing homogenized SGD, we provide exact non-asymptotic high-dimensional expressions for the generalization performance of SGD in terms of a solution of a Volterra integral equation. Further we provide the exact value of the limiting excess risk in the case of quadratic losses when trained by SGD. The analysis is formulated for data matrices and target vectors that satisfy a family of resolvent conditions, which can roughly be viewed as a weak (non-quantitative) form of delocalization of sample-side singular vectors of the data. Several motivating applications are provided including sample covariance matrices with independent samples and random features with non-generative model targets.

preprint2022arXiv

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem. Even in the simple setting of convex quadratic problems, worst-case analyses give an asymptotic convergence rate for SGD that is no better than full-batch gradient descent (GD), and the purported implicit regularization effects of SGD lack a precise explanation. In this work, we study the dynamics of multi-pass SGD on high-dimensional convex quadratics and establish an asymptotic equivalence to a stochastic differential equation, which we call homogenized stochastic gradient descent (HSGD), whose solutions we characterize explicitly in terms of a Volterra integral equation. These results yield precise formulas for the learning and risk trajectories, which reveal a mechanism of implicit conditioning that explains the efficiency of SGD relative to GD. We also prove that the noise from SGD negatively impacts generalization performance, ruling out the possibility of any type of implicit regularization in this context. Finally, we show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD (bootstrap risk).

preprint2022arXiv

Synergy and Symmetry in Deep Learning: Interactions between the Data, Model, and Inference Algorithm

Although learning in high dimensions is commonly believed to suffer from the curse of dimensionality, modern machine learning methods often exhibit an astonishing power to tackle a wide range of challenging real-world learning problems without using abundant amounts of data. How exactly these methods break this curse remains a fundamental open question in the theory of deep learning. While previous efforts have investigated this question by studying the data (D), model (M), and inference algorithm (I) as independent modules, in this paper, we analyze the triplet (D, M, I) as an integrated system and identify important synergies that help mitigate the curse of dimensionality. We first study the basic symmetries associated with various learning algorithms (M, I), focusing on four prototypical architectures in deep learning: fully-connected networks (FCN), locally-connected networks (LCN), and convolutional networks with and without pooling (GAP/VEC). We find that learning is most efficient when these symmetries are compatible with those of the data distribution and that performance significantly deteriorates when any member of the (D, M, I) triplet is inconsistent or suboptimal.

preprint2022arXiv

Wide Bayesian neural networks have a simple weight posterior: theory and accelerated sampling

We introduce repriorisation, a data-dependent reparameterisation which transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow. The repriorisation map acts directly on parameters, and its analytic simplicity complements the known neural network Gaussian process (NNGP) behaviour of wide BNNs in function space. Exploiting the repriorisation, we develop a Markov chain Monte Carlo (MCMC) posterior sampling algorithm which mixes faster the wider the BNN. This contrasts with the typically poor performance of MCMC in high dimensions. We observe up to 50x higher effective sample size relative to no reparametrisation for both fully-connected and residual networks. Improvements are achieved at all widths, with the margin between reparametrised and standard BNNs growing with layer width.

preprint2020arXiv

Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and achieve state of the art results on CIFAR10 for GPs without trainable kernels. We also introduce a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible. Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs with and without weight sharing are identical. As a consequence, translation equivariance, beneficial in finite channel CNNs trained with stochastic gradient descent (SGD), is guaranteed to play no role in the Bayesian treatment of the infinite channel limit - a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally, that while in some scenarios the performance of SGD-trained finite CNNs approaches that of the corresponding GPs as the channel count increases, with careful tuning SGD-trained CNNs can significantly outperform their corresponding GPs, suggesting advantages from SGD training compared to fully Bayesian parameter estimation.

preprint2020arXiv

Disentangling Trainability and Generalization in Deep Neural Networks

A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data. In this work, we provide such a characterization in the limit of very wide and very deep networks, for which the analysis simplifies considerably. For wide networks, the trajectory under gradient descent is governed by the Neural Tangent Kernel (NTK), and for deep networks the NTK itself maintains only weak data dependence. By analyzing the spectrum of the NTK, we formulate necessary conditions for trainability and generalization across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). We identify large regions of hyperparameter space for which networks can memorize the training set but completely fail to generalize. We find that CNNs without global average pooling behave almost identically to FCNs, but that CNNs with pooling have markedly different and often better generalization performance. These theoretical results are corroborated experimentally on CIFAR10 for a variety of network architectures and we include a colab notebook that reproduces the essential results of the paper.

preprint2020arXiv

Finite Versus Infinite Neural Networks: an Empirical Study

We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; centered and ensembled finite networks have reduced posterior variance and behave more similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence between finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; diagonal regularization of kernels acts similarly to early stopping; floating point precision limits kernel performance beyond a critical dataset size; regularized ZCA whitening improves accuracy; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena; equivariance of CNNs is only beneficial for narrow networks far from the kernel regime. Our experiments additionally motivate an improved layer-wise scaling for weight decay which improves generalization in finite-width networks. Finally, we develop improved best practices for using NNGP and NT kernels for prediction, including a novel ensembling technique. Using these best practices we achieve state-of-the-art results on CIFAR-10 classification for kernels corresponding to each architecture class we consider.

preprint2020arXiv

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this work, we analyze the effect of initialization in deep linear networks, and provide for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights. We show that for deep networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence with Gaussian initializations scales linearly in the depth. Our results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.

preprint2020arXiv

The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well. An emerging paradigm for describing this unexpected behavior is in terms of a \emph{double descent} curve, in which increasing a model's capacity causes its test error to first decrease, then increase to a maximum near the interpolation threshold, and then decrease again in the overparameterized regime. Recent efforts to explain this phenomenon theoretically have focused on simple settings, such as linear regression or kernel regression with unstructured random features, which we argue are too coarse to reveal important nuances of actual neural networks. We provide a precise high-dimensional asymptotic analysis of generalization under kernel regression with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks optimized with gradient descent. Our results reveal that the test error has non-monotonic behavior deep in the overparameterized regime and can even exhibit additional peaks and descents when the number of parameters scales quadratically with the dataset size.

preprint2020arXiv

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes. In this work, we show that these common perceptions can be completely false in the early phase of learning. In particular, we formally prove that, for a class of well-behaved input distributions, the early-time learning dynamics of a two-layer fully-connected neural network can be mimicked by training a simple linear model on the inputs. We additionally argue that this surprising simplicity can persist in networks with more layers and with convolutional architecture, which we verify empirically. Key to our analysis is to bound the spectral norm of the difference between the Neural Tangent Kernel (NTK) at initialization and an affine transform of the data kernel; however, unlike many previous results utilizing the NTK, we do not require the network to have disproportionately large width, and the network is allowed to escape the kernel regime later in training.

preprint2019arXiv

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.

preprint2014arXiv

Bootstrapping six-gluon scattering in planar ${\cal N}=4$ super-Yang-Mills theory

We describe the hexagon function bootstrap for solving for six-gluon scattering amplitudes in the large $N_c$ limit of ${\cal N}=4$ super-Yang-Mills theory. In this method, an ansatz for the finite part of these amplitudes is constrained at the level of amplitudes, not integrands, using boundary information. In the near-collinear limit, the dual picture of the amplitudes as Wilson loops leads to an operator product expansion which has been solved using integrability by Basso, Sever and Vieira. Factorization of the amplitudes in the multi-Regge limit provides additional boundary data. This bootstrap has been applied successfully through four loops for the maximally helicity violating (MHV) configuration of gluon helicities, and through three loops for the non-MHV case.

preprint2014arXiv

The BFKL equation, Mueller-Navelet jets and single-valued harmonic polylogarithms

We introduce a generating function for the coefficients of the leading logarithmic BFKL Green's function in transverse-momentum space, order by order in alpha_s, in terms of single-valued harmonic polylogarithms. As an application, we exhibit fully analytic azimuthal-angle and transverse-momentum distributions for Mueller-Navelet jet cross sections at each order in alpha_s. We also provide a generating function for the total cross section valid to any number of loops.

preprint2014arXiv

The four-loop remainder function and multi-Regge behavior at NNLLA in planar N=4 super-Yang-Mills theory

We present the four-loop remainder function for six-gluon scattering with maximal helicity violation in planar N=4 super-Yang-Mills theory, as an analytic function of three dual-conformal cross ratios. The function is constructed entirely from its analytic properties, without ever inspecting any multi-loop integrand. We employ the same approach used at three loops, writing an ansatz in terms of hexagon functions, and fixing coefficients in the ansatz using the multi-Regge limit and the operator product expansion in the near-collinear limit. We express the result in terms of multiple polylogarithms, and in terms of the coproduct for the associated Hopf algebra. From the remainder function, we extract the BFKL eigenvalue at next-to-next-to-leading logarithmic accuracy (NNLLA), and the impact factor at NNNLLA. We plot the remainder function along various lines and on one surface, studying ratios of successive loop orders. As seen previously through three loops, these ratios are surprisingly constant over large regions in the space of cross ratios, and they are not far from the value expected at asymptotically large orders of perturbation theory.

preprint2013arXiv

Hexagon functions and the three-loop remainder function

We present the three-loop remainder function, which describes the scattering of six gluons in the maximally-helicity-violating configuration in planar N=4 super-Yang-Mills theory, as a function of the three dual conformal cross ratios. The result can be expressed in terms of multiple Goncharov polylogarithms. We also employ a more restricted class of "hexagon functions" which have the correct branch cuts and certain other restrictions on their symbols. We classify all the hexagon functions through transcendental weight five, using the coproduct for their Hopf algebra iteratively, which amounts to a set of first-order differential equations. The three-loop remainder function is a particular weight-six hexagon function, whose symbol was determined previously. The differential equations can be integrated numerically for generic values of the cross ratios, or analytically in certain kinematics limits, including the near-collinear and multi-Regge limits. These limits allow us to impose constraints from the operator product expansion and multi-Regge factorization directly at the function level, and thereby to fix uniquely a set of Riemann-zeta-valued constants that could not be fixed at the level of the symbol. The near-collinear limits agree precisely with recent predictions by Basso, Sever and Vieira based on integrability. The multi-Regge limits agree with the factorization formula of Fadin and Lipatov, and determine three constants entering the impact factor at this order. We plot the three-loop remainder function for various slices of the Euclidean region of positive cross ratios, and compare it to the two-loop one. For large ranges of the cross ratios, the ratio of the three-loop to the two-loop remainder function is relatively constant, and close to -7.

preprint2013arXiv

Leading singularities and off-shell conformal integrals

The three-loop four-point function of stress-tensor multiplets in N=4 super Yang-Mills theory contains two so far unknown, off-shell, conformal integrals, in addition to the known, ladder-type integrals. In this paper we evaluate the unknown integrals, thus obtaining the three-loop correlation function analytically. The integrals have the generic structure of rational functions multiplied by (multiple) polylogarithms. We use the idea of leading singularities to obtain the rational coefficients, the symbol - with an appropriate ansatz for its structure - as a means of characterising multiple polylogarithms, and the technique of asymptotic expansion of Feynman integrals to obtain the integrals in certain limits. The limiting behaviour uniquely fixes the symbols of the integrals, which we then lift to find the corresponding polylogarithmic functions. The final formulae are numerically confirmed. The techniques we develop can be applied more generally, and we illustrate this by analytically evaluating one of the integrals contributing to the same four-point function at four loops. This example shows a connection between the leading singularities and the entries of the symbol.

preprint2012arXiv

Single-valued harmonic polylogarithms and the multi-Regge limit

We argue that the natural functions for describing the multi-Regge limit of six-gluon scattering in planar N=4 super Yang-Mills theory are the single-valued harmonic polylogarithmic functions introduced by Brown. These functions depend on a single complex variable and its conjugate, (w,w*). Using these functions, and formulas due to Fadin, Lipatov and Prygarin, we determine the six-gluon MHV remainder function in the leading-logarithmic approximation (LLA) in this limit through ten loops, and the next-to-LLA (NLLA) terms through nine loops. In separate work, we have determined the symbol of the four-loop remainder function for general kinematics, up to 113 constants. Taking its multi-Regge limit and matching to our four-loop LLA and NLLA results, we fix all but one of the constants that survive in this limit. The multi-Regge limit factorizes in the variables (ν,n) which are related to (w,w*) by a Fourier-Mellin transform. We can transform the single-valued harmonic polylogarithms to functions of (ν,n) that incorporate harmonic sums, systematically through transcendental weight six. Combining this information with the four-loop results, we determine the eigenvalues of the BFKL kernel in the adjoint representation to NNLLA accuracy, and the MHV product of impact factors to NNNLLA accuracy, up to constants representing beyond-the-symbol terms and the one symbol-level constant. Remarkably, only derivatives of the polygamma function enter these results. Finally, the LLA approximation to the six-gluon NMHV amplitude is evaluated through ten loops.

preprint2012arXiv

The six-point remainder function to all loop orders in the multi-Regge limit

We present an all-orders formula for the six-point amplitude of planar maximally supersymmetric N=4 Yang-Mills theory in the leading-logarithmic approximation of multi-Regge kinematics. In the MHV helicity configuration, our results agree with an integral formula of Lipatov and Prygarin through at least 14 loops. A differential equation linking the MHV and NMHV helicity configurations has a natural action in the space of functions relevant to this problem---the single-valued harmonic polylogarithms introduced by Brown. These functions depend on a single complex variable and its conjugate, w and w*, which are quadratically related to the original kinematic variables. We investigate the all-orders formula in the near-collinear limit, which is approached as |w|->0. Up to power-suppressed terms, the resulting expansion may be organized by powers of log|w|. The leading term of this expansion agrees with the all-orders double-leading-logarithmic approximation of Bartels, Lipatov, and Prygarin. The explicit form for the sub-leading powers of log|w| is given in terms of modified Bessel functions.

Jeffrey Pennington

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Synergy and Symmetry in Deep Learning: Interactions between the Data, Model, and Inference Algorithm

Wide Bayesian neural networks have a simple weight posterior: theory and accelerated sampling

Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

Disentangling Trainability and Generalization in Deep Neural Networks

Finite Versus Infinite Neural Networks: an Empirical Study

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

Bootstrapping six-gluon scattering in planar ${\cal N}=4$ super-Yang-Mills theory

The BFKL equation, Mueller-Navelet jets and single-valued harmonic polylogarithms

The four-loop remainder function and multi-Regge behavior at NNLLA in planar N=4 super-Yang-Mills theory

Hexagon functions and the three-loop remainder function

Leading singularities and off-shell conformal integrals

Single-valued harmonic polylogarithms and the multi-Regge limit

The six-point remainder function to all loop orders in the multi-Regge limit