Source author record

Samuel S. Schoenholz

Samuel S. Schoenholz appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

cond-mat.soft Machine Learning cond-mat.mtrl-sci cond-mat.stat-mech Artificial Intelligence cond-mat.dis-nn hep-lat hep-th physics.chem-ph

Catalog footprint

What is connected

16works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Deep equilibrium networks are sensitive to initialization statistics

Deep equilibrium networks (DEQs) are a promising way to construct models which trade off memory for compute. However, theoretical understanding of these models is still lacking compared to traditional networks, in part because of the repeated application of a single set of weights. We show that DEQs are sensitive to the higher order statistics of the matrix families from which they are initialized. In particular, initializing with orthogonal or symmetric matrices allows for greater stability in training. This gives us a practical prescription for initializations which allow for training with a broader range of initial weight scales.

preprint2022arXiv

Fast Finite Width Neural Tangent Kernel

The Neural Tangent Kernel (NTK), defined as $Θ_θ^f(x_1, x_2) = \left[\partial f(θ, x_1)\big/\partial θ\right] \left[\partial f(θ, x_2)\big/\partial θ\right]^T$ where $\left[\partial f(θ, \cdot)\big/\partial θ\right]$ is a neural network (NN) Jacobian, has emerged as a central object of study in deep learning. In the infinite width limit, the NTK can sometimes be computed analytically and is useful for understanding training and generalization of NN architectures. At finite widths, the NTK is also used to better initialize NNs, compare the conditioning across models, perform architecture search, and do meta-learning. Unfortunately, the finite width NTK is notoriously expensive to compute, which severely limits its practical utility. We perform the first in-depth analysis of the compute and memory requirements for NTK computation in finite width networks. Leveraging the structure of neural networks, we further propose two novel algorithms that change the exponent of the compute and memory requirements of the finite width NTK, dramatically improving efficiency. Our algorithms can be applied in a black box fashion to any differentiable function, including those implementing neural networks. We open-source our implementations within the Neural Tangents package (arXiv:1912.02803) at https://github.com/google/neural-tangents.

preprint2022arXiv

Gradients are Not All You Need

Differentiable programming techniques are widely used in the community and are responsible for the machine learning renaissance of the past several decades. While these methods are powerful, they have limits. In this short report, we discuss a common chaos based failure mode which appears in a variety of differentiable circumstances, ranging from recurrent neural networks and numerical physics simulation to training learned optimizers. We trace this failure to the spectrum of the Jacobian of the system under study, and provide criteria for when a practitioner might expect this failure to spoil their differentiation based optimization algorithms.

preprint2020arXiv

Disentangling Trainability and Generalization in Deep Neural Networks

A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data. In this work, we provide such a characterization in the limit of very wide and very deep networks, for which the analysis simplifies considerably. For wide networks, the trajectory under gradient descent is governed by the Neural Tangent Kernel (NTK), and for deep networks the NTK itself maintains only weak data dependence. By analyzing the spectrum of the NTK, we formulate necessary conditions for trainability and generalization across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). We identify large regions of hyperparameter space for which networks can memorize the training set but completely fail to generalize. We find that CNNs without global average pooling behave almost identically to FCNs, but that CNNs with pooling have markedly different and often better generalization performance. These theoretical results are corroborated experimentally on CIFAR10 for a variety of network architectures and we include a colab notebook that reproduces the essential results of the paper.

preprint2020arXiv

Finite Versus Infinite Neural Networks: an Empirical Study

We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; centered and ensembled finite networks have reduced posterior variance and behave more similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence between finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; diagonal regularization of kernels acts similarly to early stopping; floating point precision limits kernel performance beyond a critical dataset size; regularized ZCA whitening improves accuracy; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena; equivariance of CNNs is only beneficial for narrow networks far from the kernel regime. Our experiments additionally motivate an improved layer-wise scaling for weight decay which improves generalization in finite-width networks. Finally, we develop improved best practices for using NNGP and NT kernels for prediction, including a novel ensembling technique. Using these best practices we achieve state-of-the-art results on CIFAR-10 classification for kernels corresponding to each architecture class we consider.

preprint2020arXiv

On the infinite width limit of neural networks with a standard parameterization

There are currently two parameterizations used to derive fixed kernels corresponding to infinite width neural networks, the NTK (Neural Tangent Kernel) parameterization and the naive standard parameterization. However, the extrapolation of both of these parameterizations to infinite width is problematic. The standard parameterization leads to a divergent neural tangent kernel while the NTK parameterization fails to capture crucial aspects of finite width networks such as: the dependence of training dynamics on relative layer widths, the relative training dynamics of weights and biases, and overall learning rate scale. Here we propose an improved extrapolation of the standard parameterization that preserves all of these properties as width is taken to infinity and yields a well-defined neural tangent kernel. We show experimentally that the resulting kernels typically achieve similar accuracy to those resulting from an NTK parameterization, but with better correspondence to the parameterization of typical finite width networks. Additionally, with careful tuning of width parameters, the improved standard parameterization kernels can outperform those stemming from an NTK parameterization. We release code implementing this improved standard parameterization as part of the Neural Tangents library at https://github.com/google/neural-tangents.

preprint2020arXiv

Unifying framework for strong and fragile liquids via machine learning: a study of liquid silica

The fragility of a glassforming liquid characterizes how rapidly its relaxation dynamics slow down with cooling. The viscosity of strong liquids follows an Arrhenius law with a temperature-independent barrier height to rearrangements responsible for relaxation, whereas fragile liquids experience a much faster increase in their dynamics, suggesting a barrier height that increases with decreasing temperature. Strong glassformers are typically network glasses, while fragile glassformers are typically molecular or hard-sphere-like. As a result of these differences at the microscopic level, strong and fragile glassformers are usually treated separately from a theoretical point of view. Silica is the archetypal strong glassformer at low temperatures, but also exhibits a mysterious strong-to-fragile crossover at higher temperatures. Here we show that softness, a structure-based machine learned parameter that has previously been applied to fragile glassformers provides a useful description of model liquid silica in the strong and fragile regimes, and through the strong-to-fragile crossover. Just as for fragile glassformers, the relationship between softness and dynamics is invariant and Arrhenius in all regimes, but the average softness changes with temperature. The strong-to-fragile crossover in silica is not due to a sudden, qualitative change in structure, but can be explained by a simple Arrhenius form with a continuously and linearly changing local structure. Our results unify the study of liquid silica under a single simple conceptual picture.

preprint2019arXiv

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.

preprint2017arXiv

Machine learning prediction errors better than DFT accuracy

We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of thirteen electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed using learning curves which report out-of-sample errors as a function of training set size with up to $\sim$117k distinct molecules. Molecular structures and properties at hybrid density functional theory (DFT) level of theory used for training and testing come from the QM9 database [Ramakrishnan et al, {\em Scientific Data} {\bf 1} 140022 (2014)] and include dipole moment, polarizability, HOMO/LUMO energies and gap, electronic spatial extent, zero point vibrational energy, enthalpies and free energies of atomization, heat capacity and the highest fundamental vibrational frequency. Various representations from the literature have been studied (Coulomb matrix, bag of bonds, BAML and ECFP4, molecular graphs (MG)), as well as newly developed distribution based variants including histograms of distances (HD), and angles (HDA/MARAD), and dihedrals (HDAD). Regressors include linear models (Bayesian ridge regression (BR) and linear regression with elastic net regularization (EN)), random forest (RF), kernel ridge regression (KRR) and two types of neural net works, graph convolutions (GC) and gated graph networks (GG). We present numerical evidence that ML model predictions deviate from DFT less than DFT deviates from experiment for all properties. Furthermore, our out-of-sample prediction errors with respect to hybrid DFT reference are on par with, or close to, chemical accuracy. Our findings suggest that ML models could be more accurate than hybrid DFT if explicitly electron correlated quantum (or experimental) data was available.

preprint2015arXiv

A structural approach to relaxation in glassy liquids

When a liquid freezes, a change in the local atomic structure marks the transition to the crystal. When a liquid is cooled to form a glass, however, no noticeable structural change marks the glass transition. Indeed, characteristic features of glassy dynamics that appear below an onset temperature, T_0, are qualitatively captured by mean field theory, which assumes uniform local structure at all temperatures. Even studies of more realistic systems have found only weak correlations between structure and dynamics. This raises the question: is structure important to glassy dynamics in three dimensions? Here, we answer this question affirmatively by using machine learning methods to identify a new field, that we call softness, which characterizes local structure and is strongly correlated with rearrangement dynamics. We find that the onset of glassy dynamics at T_0 is marked by the onset of correlations between softness (i.e. structure) and dynamics. Moreover, we use softness to construct a simple model of slow glassy relaxation that is in excellent agreement with our simulation results, showing that a theory of the evolution of softness in time would constitute a theory of glassy dynamics.

preprint2015arXiv

Nonlinear Sigma Models with Compact Hyperbolic Target Spaces

We explore the phase structure of nonlinear sigma models with target spaces corresponding to compact quotients of hyperbolic space, focusing on the case of a hyperbolic genus-2 Riemann surface. The continuum theory of these models can be approximated by a lattice spin system which we simulate using Monte Carlo methods. The target space possesses interesting geometric and topological properties which are reflected in novel features of the sigma model. In particular, we observe a topological phase transition at a critical temperature, above which vortices proliferate, reminiscent of the Kosterlitz-Thouless phase transition in the $O(2)$ model. Unlike in the $O(2)$ case, there are many different types of vortices, suggesting a possible analogy to the Hagedorn treatment of statistical mechanics of a proliferating number of hadron species. Below the critical temperature the spins cluster around six special points in the target space known as Weierstrass points. The diversity of compact hyperbolic manifolds suggests that our model is only the simplest example of a broad class of statistical mechanical models whose main features can be understood essentially in geometric terms.

preprint2015arXiv

Strain fluctuations and elastic moduli in disordered solids

Recently there has been a surge in interest in using video-microscopy techniques to infer the local mechanical properties of disordered solids. One common approach is to minimize the difference between particle vibrational displacements in a local coarse-graining volume and the displacements that would result from a best-fit affine deformation. Effective moduli are then be inferred under the assumption that the components of this best-fit affine deformation tensor have a Boltzmann distribution. In this paper, we combine theoretical arguments with experimental and simulation data to demonstrate that the above does not reveal information about the true elastic moduli of jammed packings and colloidal glasses.

preprint2014arXiv

Identifying structural flow defects in disordered solids using machine learning methods

We use machine learning methods on local structure to identify flow defects - or regions susceptible to rearrangement - in jammed and glassy systems. We apply this method successfully to two disparate systems: a two dimensional experimental realization of a granular pillar under compression, and a Lennard-Jones glass in both two and three dimensions above and below its glass transition temperature. We also identify characteristics of flow defects that differentiate them from the rest of the sample. Our results show it is possible to discern subtle structural features responsible for heterogeneous dynamics observed across a broad range of disordered materials.

preprint2014arXiv

Predicting plasticity with soft vibrational modes: from dislocations to glasses

We show that quasi localized low-frequency modes in the vibrational spectrum can be used to construct soft spots, or regions vulnerable to rearrangement, which serve as a universal tool for the identification of flow defects in solids. We show that soft spots not only encode spatial information, via their location, but also directional information, via directors for particles within each soft spot. Single crystals with isolated dislocations exhibit low-frequency phonon modes that localize at the core, and their polarization pattern predicts the motion of atoms during elementary dislocation glide in exquisite detail. Even in polycrystals and disordered solids, we find that the directors associated with particles in soft spots are highly correlated with the direction of particle displacements in rearrangements.

preprint2014arXiv

Understanding plastic deformation in thermal glasses from single-soft-spot dynamics

By considering the low-frequency vibrational modes of amorphous solids, Manning and Liu [Phys. Rev. Lett. 107, 108302 (2011)] showed that a population of "soft spots" can be identified that are intimately related to plasticity at zero temperature under quasistatic shear. In this work we track individual soft spots with time in a two-dimensional sheared thermal Lennard Jones glass at temperatures ranging from deep in the glassy regime to above the glass transition temperature. We show that the lifetimes of individual soft spots are correlated with the timescale for structural relaxation. We additionally calculate the number of rearrangements required to destroy soft spots, and show that most soft spots can survive many rearrangements. Finally, we show that soft spots are robust predictors of rearrangements at temperatures well into the super-cooled regime. Altogether, these results pave the way for mesoscopic theories of plasticity of amorphous solids based on dynamical behavior of individual soft spots.

preprint2013arXiv

Stability of jammed packings II: the transverse length scale

As a function of packing fraction at zero temperature and applied stress, an amorphous packing of spheres exhibits a jamming transition where the system is sensitive to boundary conditions even in the thermodynamic limit. Upon further compression, the system should become insensitive to boundary conditions provided it is sufficiently large. Here we explore the linear response to a large class of boundary perturbations in 2 and 3 dimensions. We consider each finite packing with periodic-boundary conditions as the basis of an infinite square or cubic lattice and study properties of vibrational modes at arbitrary wave vector. We find that the stability of such modes be understood in terms of a competition between plane waves and the anomalous vibrational modes associated with the jamming transition; infinitesimal boundary perturbations become irrelevant for systems that are larger than a length scale that characterizes the transverse excitations. This previously identified length diverges at the jamming transition.

Samuel S. Schoenholz

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

Deep equilibrium networks are sensitive to initialization statistics

Fast Finite Width Neural Tangent Kernel

Gradients are Not All You Need

Disentangling Trainability and Generalization in Deep Neural Networks

Finite Versus Infinite Neural Networks: an Empirical Study

On the infinite width limit of neural networks with a standard parameterization

Unifying framework for strong and fragile liquids via machine learning: a study of liquid silica

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

Machine learning prediction errors better than DFT accuracy

A structural approach to relaxation in glassy liquids

Nonlinear Sigma Models with Compact Hyperbolic Target Spaces

Strain fluctuations and elastic moduli in disordered solids

Identifying structural flow defects in disordered solids using machine learning methods

Predicting plasticity with soft vibrational modes: from dislocations to glasses

Understanding plastic deformation in thermal glasses from single-soft-spot dynamics

Stability of jammed packings II: the transverse length scale