Source author record

Lechao Xiao

Lechao Xiao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.CA Neural and Evolutionary Computing Artificial Intelligence math.OC

Catalog footprint

What is connected

15works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Dataset Distillation with Infinitely Wide Convolutional Networks

The effectiveness of machine learning algorithms arises from being able to extract useful features from large amounts of data. As model and dataset sizes increase, dataset distillation methods that compress large datasets into significantly smaller yet highly performant ones will become valuable in terms of training efficiency and useful feature extraction. To that end, we apply a novel distributed kernel based meta-learning framework to achieve state-of-the-art results for dataset distillation using infinitely wide convolutional neural networks. For instance, using only 10 datapoints (0.02% of original dataset), we obtain over 65% test accuracy on CIFAR-10 image classification task, a dramatic improvement over the previous best test accuracy of 40%. Our state-of-the-art results extend across many other settings for MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, and SVHN. Furthermore, we perform some preliminary analyses of our distilled datasets to shed light on how they differ from naturally occurring data.

preprint2022arXiv

Fast Neural Kernel Embeddings for General Activations

Infinite width limit has shed light on generalization and optimization aspects of deep learning by establishing connections between neural networks and kernel methods. Despite their importance, the utility of these kernel methods was limited in large-scale learning settings due to their (super-)quadratic runtime and memory complexities. Moreover, most prior works on neural kernels have focused on the ReLU activation, mainly due to its popularity but also due to the difficulty of computing such kernels for general activations. In this work, we overcome such difficulties by providing methods to work with general activations. First, we compile and expand the list of activation functions admitting exact dual activation expressions to compute neural kernels. When the exact computation is unknown, we present methods to effectively approximate them. We propose a fast sketching method that approximates any multi-layered Neural Network Gaussian Process (NNGP) kernel and Neural Tangent Kernel (NTK) matrices for a wide range of activation functions, going beyond the commonly analyzed ReLU activation. This is done by showing how to approximate the neural kernels using the truncated Hermite expansion of any desired activation functions. While most prior works require data points on the unit sphere, our methods do not suffer from such limitations and are applicable to any dataset of points in $\mathbb{R}^d$. Furthermore, we provide a subspace embedding for NNGP and NTK matrices with near input-sparsity runtime and near-optimal target dimension which applies to any \emph{homogeneous} dual activation functions with rapidly convergent Taylor expansion. Empirically, with respect to exact convolutional NTK (CNTK) computation, our method achieves $106\times$ speedup for approximate CNTK of a 5-layer Myrtle network on CIFAR-10 dataset.

preprint2022arXiv

Synergy and Symmetry in Deep Learning: Interactions between the Data, Model, and Inference Algorithm

Although learning in high dimensions is commonly believed to suffer from the curse of dimensionality, modern machine learning methods often exhibit an astonishing power to tackle a wide range of challenging real-world learning problems without using abundant amounts of data. How exactly these methods break this curse remains a fundamental open question in the theory of deep learning. While previous efforts have investigated this question by studying the data (D), model (M), and inference algorithm (I) as independent modules, in this paper, we analyze the triplet (D, M, I) as an integrated system and identify important synergies that help mitigate the curse of dimensionality. We first study the basic symmetries associated with various learning algorithms (M, I), focusing on four prototypical architectures in deep learning: fully-connected networks (FCN), locally-connected networks (LCN), and convolutional networks with and without pooling (GAP/VEC). We find that learning is most efficient when these symmetries are compatible with those of the data distribution and that performance significantly deteriorates when any member of the (D, M, I) triplet is inconsistent or suboptimal.

preprint2020arXiv

Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and achieve state of the art results on CIFAR10 for GPs without trainable kernels. We also introduce a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible. Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs with and without weight sharing are identical. As a consequence, translation equivariance, beneficial in finite channel CNNs trained with stochastic gradient descent (SGD), is guaranteed to play no role in the Bayesian treatment of the infinite channel limit - a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally, that while in some scenarios the performance of SGD-trained finite CNNs approaches that of the corresponding GPs as the channel count increases, with careful tuning SGD-trained CNNs can significantly outperform their corresponding GPs, suggesting advantages from SGD training compared to fully Bayesian parameter estimation.

preprint2020arXiv

Disentangling Trainability and Generalization in Deep Neural Networks

A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data. In this work, we provide such a characterization in the limit of very wide and very deep networks, for which the analysis simplifies considerably. For wide networks, the trajectory under gradient descent is governed by the Neural Tangent Kernel (NTK), and for deep networks the NTK itself maintains only weak data dependence. By analyzing the spectrum of the NTK, we formulate necessary conditions for trainability and generalization across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). We identify large regions of hyperparameter space for which networks can memorize the training set but completely fail to generalize. We find that CNNs without global average pooling behave almost identically to FCNs, but that CNNs with pooling have markedly different and often better generalization performance. These theoretical results are corroborated experimentally on CIFAR10 for a variety of network architectures and we include a colab notebook that reproduces the essential results of the paper.

preprint2020arXiv

Finite Versus Infinite Neural Networks: an Empirical Study

We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; centered and ensembled finite networks have reduced posterior variance and behave more similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence between finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; diagonal regularization of kernels acts similarly to early stopping; floating point precision limits kernel performance beyond a critical dataset size; regularized ZCA whitening improves accuracy; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena; equivariance of CNNs is only beneficial for narrow networks far from the kernel regime. Our experiments additionally motivate an improved layer-wise scaling for weight decay which improves generalization in finite-width networks. Finally, we develop improved best practices for using NNGP and NT kernels for prediction, including a novel ensembling technique. Using these best practices we achieve state-of-the-art results on CIFAR-10 classification for kernels corresponding to each architecture class we consider.

preprint2020arXiv

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this work, we analyze the effect of initialization in deep linear networks, and provide for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights. We show that for deep networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence with Gaussian initializations scales linearly in the depth. Our results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.

preprint2020arXiv

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes. In this work, we show that these common perceptions can be completely false in the early phase of learning. In particular, we formally prove that, for a class of well-behaved input distributions, the early-time learning dynamics of a two-layer fully-connected neural network can be mimicked by training a simple linear model on the inputs. We additionally argue that this surprising simplicity can persist in networks with more layers and with convolutional architecture, which we verify empirically. Key to our analysis is to bound the spectral norm of the difference between the Neural Tangent Kernel (NTK) at initialization and an affine transform of the data kernel; however, unlike many previous results utilizing the NTK, we do not require the network to have disproportionately large width, and the network is allowed to escape the kernel regime later in training.

preprint2019arXiv

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.

preprint2016arXiv

Endpoint estimates for one-dimensional oscillatory integral operator

The one-dimensional oscillatory integral operator associated to a real analytic phase $S$ is given by $$ T_λf(x) =\int_{-\infty}^\infty e^{iλS(x,y)} χ(x,y) f(y) dy. $$ In this paper, we obtain a complete characterization for the mapping properties of $T_λ$ on $L^p(\mathbb R)$ spaces, namely we prove that $\|T_λ\|_p \lesssim |λ|^{-α}\|f\|_p$ for some $α>0$ if and only if the point $(\frac 1 {αp} , \frac 1 {αp'})$ lies in the reduced Newton polygon of $S$, and this estimate is sharp if and only if it lies on the reduced Newton diagram.

preprint2016arXiv

Higher decay inequalities for multilinear oscillatory integrals

In this paper we establish sharp estimates (up to logarithmic losses) for the multilinear oscillatory integral operator studied by Phong, Stein, and Sturm and Carbery and Wright on any product $\prod_{j=1}^d L^{p_j}(\mathbb R)$ with each $p_j \geq 2$, expanding the known results for this operator well outside the previous range $\sum_{j=1}^d p_j^{-1} = d-1$. Our theorem assumes second-order nondegeneracy condition of Varchenko type, and as a corollary reproduces Varchenko's theorem and implies Fourier decay estimates for measures of smooth density on degenerate hypersurfaces in $\mathbb R^d$.

preprint2016arXiv

Sharp estimates for trilinear oscillatory integrals and an algorithm of two-dimensional resolution of singularities

We obtain sharp estimates for certain trilinear oscillatory integrals. In particular, we extend Phong and Stein's seminal result to a trilinear setting. This result partially answers a question raised by Christ, Li, Tao and Thiele concerning the sharp estimates for certain multilinear oscillatory integrals. The method in this paper relies on a self-contained algorithm of resolution of singularities in $\mathbb R^2$, which may be of independent interest.

preprint2015arXiv

Maximal Decay Inequalities for Trilinear oscillatory integrals of convolution type

In this paper we prove sharp $L^\infty$-$L^\infty$-$L^\infty$ decay for certain trilinear oscillatory integral forms of convolution type on $\mathbb R^2$. These estimates imply earlier $L^2$-$L^2$-$L^2$ results obtained by the second author as well as corresponding sharp, stable sublevel set estimates of the form studied by Christ and Christ, Li, Tao, and Thiele. New connections to the multilinear results of Phong, Stein, and Sturm are also considered.

preprint2014arXiv

Bilinear Hilbert transforms associated to plane curves

We prove that the bilinear Hilbert transforms and maximal functions along certain general plane curves are bounded from $L^2(\mathbb{R})\times L^2(\mathbb{R})$ to $L^1(\mathbb{R})$.

preprint2013arXiv

Uniform estimates for bilinear Hilbert transform and bilinear maximal functions associated to polynomials

We study the bilinear Hilbert transform and bilinear maximal functions associated to polynomial curves and obtain uniform $L^r$ estimates for $r>\frac{d-1}{d}$ and this index is sharp up to the end point.

Lechao Xiao

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

Dataset Distillation with Infinitely Wide Convolutional Networks

Fast Neural Kernel Embeddings for General Activations

Synergy and Symmetry in Deep Learning: Interactions between the Data, Model, and Inference Algorithm

Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

Disentangling Trainability and Generalization in Deep Neural Networks

Finite Versus Infinite Neural Networks: an Empirical Study

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

Endpoint estimates for one-dimensional oscillatory integral operator

Higher decay inequalities for multilinear oscillatory integrals

Sharp estimates for trilinear oscillatory integrals and an algorithm of two-dimensional resolution of singularities

Maximal Decay Inequalities for Trilinear oscillatory integrals of convolution type

Bilinear Hilbert transforms associated to plane curves

Uniform estimates for bilinear Hilbert transform and bilinear maximal functions associated to polynomials