Source author record

Samuel L. Smith

Samuel L. Smith appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computer Vision quant-ph cond-mat.mtrl-sci cond-mat.other Cryptography and Security

Catalog footprint

What is connected

9works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error

In computer vision, it is standard practice to draw a single sample from the data augmentation procedure for each unique image in the mini-batch. However recent work has suggested drawing multiple samples can achieve higher test accuracies. In this work, we provide a detailed empirical evaluation of how the number of augmentation samples per unique image influences model performance on held out data when training deep ResNets. We demonstrate drawing multiple samples per image consistently enhances the test accuracy achieved for both small and large batch training. Crucially, this benefit arises even if different numbers of augmentations per image perform the same number of parameter updates and gradient evaluations (requiring the same total compute). Although prior work has found variance in the gradient estimate arising from subsampling the dataset has an implicit regularization benefit, our experiments suggest variance which arises from the data augmentation process harms generalization. We apply these insights to the highly performant NFNet-F5, achieving 86.8$\%$ top-1 w/o extra data on ImageNet.

preprint2022arXiv

Unlocking High-Accuracy Differentially Private Image Classification through Scale

Differential Privacy (DP) provides a formal privacy guarantee preventing adversaries with access to a machine learning model from extracting information about individual training points. Differentially Private Stochastic Gradient Descent (DP-SGD), the most popular DP training method for deep learning, realizes this protection by injecting noise during training. However previous works have found that DP-SGD often leads to a significant degradation in performance on standard image classification benchmarks. Furthermore, some authors have postulated that DP-SGD inherently performs poorly on large models, since the norm of the noise required to preserve privacy is proportional to the model dimension. In contrast, we demonstrate that DP-SGD on over-parameterized models can perform significantly better than previously thought. Combining careful hyper-parameter tuning with simple techniques to ensure signal propagation and improve the convergence rate, we obtain a new SOTA without extra data on CIFAR-10 of 81.4% under (8, 10^{-5})-DP using a 40-layer Wide-ResNet, improving over the previous SOTA of 71.7%. When fine-tuning a pre-trained NFNet-F3, we achieve a remarkable 83.8% top-1 accuracy on ImageNet under (0.5, 8*10^{-7})-DP. Additionally, we also achieve 86.7% top-1 accuracy under (8, 8 \cdot 10^{-7})-DP, which is just 4.3% below the current non-private SOTA for this task. We believe our results are a significant step towards closing the accuracy gap between private and non-private image classification.

preprint2021arXiv

Characterizing signal propagation to close the performance gap in unnormalized ResNets

Batch Normalization is a key component in almost all state-of-the-art image classifiers, but it also introduces practical challenges: it breaks the independence between training examples within a batch, can incur compute and memory overhead, and often results in unexpected bugs. Building on recent theoretical analyses of deep ResNets at initialization, we propose a simple set of analysis tools to characterize signal propagation on the forward pass, and leverage these tools to design highly performant ResNets without activation normalization layers. Crucial to our success is an adapted version of the recently proposed Weight Standardization. Our analysis tools show how this technique preserves the signal in networks with ReLU or Swish activation functions by ensuring that the per-channel activation means do not grow with depth. Across a range of FLOP budgets, our networks attain performance competitive with the state-of-the-art EfficientNets on ImageNet.

preprint2021arXiv

High-Performance Large-Scale Image Recognition Without Normalization

Batch normalization is a key component of most image classification models, but it has many undesirable properties stemming from its dependence on the batch size and interactions between examples. Although recent work has succeeded in training deep ResNets without normalization layers, these models do not match the test accuracies of the best batch-normalized networks, and are often unstable for large learning rates or strong data augmentations. In this work, we develop an adaptive gradient clipping technique which overcomes these instabilities, and design a significantly improved class of Normalizer-Free ResNets. Our smaller models match the test accuracy of an EfficientNet-B7 on ImageNet while being up to 8.7x faster to train, and our largest models attain a new state-of-the-art top-1 accuracy of 86.5%. In addition, Normalizer-Free models attain significantly better performance than their batch-normalized counterparts when finetuning on ImageNet after large-scale pre-training on a dataset of 300 million labeled images, with our best models obtaining an accuracy of 89.2%. Our code is available at https://github.com/deepmind/ deepmind-research/tree/master/nfnets

preprint2021arXiv

On the Origin of Implicit Regularization in Stochastic Gradient Descent

For infinitesimal learning rates, stochastic gradient descent (SGD) follows the path of gradient flow on the full batch loss function. However moderately large learning rates can achieve higher test accuracies, and this generalization benefit is not explained by convergence bounds, since the learning rate which maximizes test accuracy is often larger than the learning rate which minimizes training loss. To interpret this phenomenon we prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite, but on a modified loss. This modified loss is composed of the original loss function and an implicit regularizer, which penalizes the norms of the minibatch gradients. Under mild assumptions, when the batch size is small the scale of the implicit regularization term is proportional to the ratio of the learning rate to the batch size. We verify empirically that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small.

preprint2020arXiv

Cold Posteriors and Aleatoric Uncertainty

Recent work has observed that one can outperform exact inference in Bayesian neural networks by tuning the "temperature" of the posterior on a validation set (the "cold posterior" effect). To help interpret this phenomenon, we argue that commonly used priors in Bayesian neural networks can significantly overestimate the aleatoric uncertainty in the labels on many classification datasets. This problem is particularly pronounced in academic benchmarks like MNIST or CIFAR, for which the quality of the labels is high. For the special case of Gaussian process regression, any positive temperature corresponds to a valid posterior under a modified prior, and tuning this temperature is directly analogous to empirical Bayes. On classification tasks, there is no direct equivalence between modifying the prior and tuning the temperature, however reducing the temperature can lead to models which better reflect our belief that one gains little information by relabeling existing examples in the training set. Therefore although cold posteriors do not always correspond to an exact inference procedure, we believe they may often better reflect our true prior beliefs.

preprint2020arXiv

On the Generalization Benefit of Noise in Stochastic Gradient Descent

It has long been argued that minibatch stochastic gradient descent can generalize better than large batch gradient descent in deep neural networks. However recent papers have questioned this claim, arguing that this effect is simply a consequence of suboptimal hyperparameter tuning or insufficient compute budgets when the batch size is large. In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. This occurs even when both models are trained for the same number of iterations and large batches achieve smaller training losses. Our results confirm that the noise in stochastic gradients can enhance generalization. We study how the optimal learning rate schedule changes as the epoch budget grows, and we provide a theoretical account of our observations based on the stochastic differential equation perspective of SGD dynamics.

preprint2015arXiv

Disorder in the spectral function of a qubit ensemble

The impact of disorder in the vibrational bath of an ensemble of open quantum systems is explored, arising either from variation in the overall coupling strength or from uncertainty in the shape of the environment spectral function. Such disorder leads to an additional source of decoherence of subsystem ensembles, due to variation in the reorganization energy. Additionally, vibrational disorder induces a shift in the oscillation frequency of the coherence between ground and excited states. This shift is temperature dependent, and in the high temperature limit $δω\propto T$. This latter finding could be of particular relevance to biological systems exhibiting long lived coherences in highly disordered environments, where temperature-dependent quantum beats have been observed.

preprint2014arXiv

Phonon-Assisted Ultrafast Charge Separation in a Realistic PCBM Aggregate

Organic solar cells must separate strongly bound electron-hole pairs into free charges. This is achieved at interfaces between electron donor and acceptor organic semiconductors. The most popular electron acceptor is the fullerene derivative PCBM. Electron-hole separation has been observed on femtosecond timescales, which is incompatible with conventional Marcus theories of organic transport. In this work we show that ultrafast charge transport in PCBM arises from its broad range of electronic eigenstates, provided by the presence of three closely spaced delocalised bands near the LUMO level. Vibrational fluctuations enable rapid transitions between these bands, which drives an electron transport of $\sim$3 nm within 100 fs. All this is demonstrated within a realistic tight binding Hamiltonian containing transfer integrals no larger than 8 meV.

Samuel L. Smith

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error

Unlocking High-Accuracy Differentially Private Image Classification through Scale

Characterizing signal propagation to close the performance gap in unnormalized ResNets

High-Performance Large-Scale Image Recognition Without Normalization

On the Origin of Implicit Regularization in Stochastic Gradient Descent

Cold Posteriors and Aleatoric Uncertainty

On the Generalization Benefit of Noise in Stochastic Gradient Descent

Disorder in the spectral function of a qubit ensemble

Phonon-Assisted Ultrafast Charge Separation in a Realistic PCBM Aggregate