Researcher profile

Tengyuan Liang

Tengyuan Liang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
17works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

17 published item(s)

preprint2023arXiv

Universal Prediction Band via Semi-Definite Programming

We propose a computationally efficient method to construct nonparametric, heteroscedastic prediction bands for uncertainty quantification, with or without any user-specified predictive model. Our approach provides an alternative to the now-standard conformal prediction for uncertainty quantification, with novel theoretical insights and computational advantages. The data-adaptive prediction band is universally applicable with minimal distributional assumptions, has strong non-asymptotic coverage properties, and is easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semi-definite programming and sum-of-squares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed.

preprint2020arXiv

Mehler's Formula, Branching Process, and Compositional Kernels of Deep Neural Networks

We utilize a connection between compositional kernels and branching processes via Mehler's formula to study deep neural networks. This new probabilistic insight provides us a novel perspective on the mathematical role of activation functions in compositional neural networks. We study the unscaled and rescaled limits of the compositional kernels and explore the different phases of the limiting behavior, as the compositional depth increases. We investigate the memorization capacity of the compositional kernels and neural networks by characterizing the interplay among compositional depth, sample size, dimensionality, and non-linearity of the activation. Explicit formulas on the eigenvalues of the compositional kernel are provided, which quantify the complexity of the corresponding reproducing kernel Hilbert space. On the methodological front, we propose a new random features algorithm, which compresses the compositional layers by devising a new activation function.

preprint2020arXiv

On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels

We study the risk of minimum-norm interpolants of data in Reproducing Kernel Hilbert Spaces. Our upper bounds on the risk are of a multiple-descent shape for the various scalings of $d = n^α$, $α\in(0,1)$, for the input dimension $d$ and sample size $n$. Empirical evidence supports our finding that minimum-norm interpolants in RKHS can exhibit this unusual non-monotonicity in sample size; furthermore, locations of the peaks in our experiments match our theoretical predictions. Since gradient flow on appropriately initialized wide neural networks converges to a minimum-norm interpolant with respect to a certain kernel, our analysis also yields novel estimation and generalization guarantees for these over-parametrized models. At the heart of our analysis is a study of spectral properties of the random kernel matrix restricted to a filtration of eigen-spaces of the population covariance operator, and may be of independent interest.

preprint2019arXiv

Deep Neural Networks for Estimation and Inference

We study deep neural networks and their use in semiparametric inference. We establish novel rates of convergence for deep feedforward neural nets. Our new rates are sufficiently fast (in some cases minimax optimal) to allow us to establish valid second-step inference after first-step estimation with deep learning, a result also new to the literature. Our estimation rates and semiparametric inference results handle the current standard architecture: fully connected feedforward neural networks (multi-layer perceptrons), with the now-common rectified linear unit activation function and a depth explicitly diverging with the sample size. We discuss other architectures as well, including fixed-width, very deep networks. We establish nonasymptotic bounds for these deep nets for a general class of nonparametric regression-type loss functions, which includes as special cases least squares, logistic regression, and other generalized linear models. We then apply our theory to develop semiparametric inference, focusing on causal parameters for concreteness, such as treatment effects, expected welfare, and decomposition effects. Inference in many other semiparametric contexts can be readily obtained. We demonstrate the effectiveness of deep learning with a Monte Carlo analysis and an empirical application to direct mail marketing.

preprint2019arXiv

Fisher-Rao Metric, Geometry, and Complexity of Neural Networks

We study the relationship between geometry and capacity measures for deep neural networks from an invariance viewpoint. We introduce a new notion of capacity --- the Fisher-Rao norm --- that possesses desirable invariance properties and is motivated by Information Geometry. We discover an analytical characterization of the new capacity measure, through which we establish norm-comparison inequalities and further show that the new measure serves as an umbrella for several existing norm-based complexity measures. We discuss upper bounds on the generalization error induced by the proposed measure. Extensive numerical experiments on CIFAR-10 support our theoretical findings. Our theoretical analysis rests on a key structural lemma about partial derivatives of multi-layer rectifier networks.

preprint2019arXiv

Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks

Motivated by the pursuit of a systematic computational and algorithmic understanding of Generative Adversarial Networks (GANs), we present a simple yet unified non-asymptotic local convergence theory for smooth two-player games, which subsumes several discrete-time gradient-based saddle point dynamics. The analysis reveals the surprising nature of the off-diagonal interaction term as both a blessing and a curse. On the one hand, this interaction term explains the origin of the slow-down effect in the convergence of Simultaneous Gradient Ascent (SGA) to stable Nash equilibria. On the other hand, for the unstable equilibria, exponential convergence can be proved thanks to the interaction term, for four modified dynamics proposed to stabilize GAN training: Optimistic Mirror Descent (OMD), Consensus Optimization (CO), Implicit Updates (IU) and Predictive Method (PM). The analysis uncovers the intimate connections among these stabilizing techniques, and provides detailed characterization on the choice of learning rate. As a by-product, we present a new analysis for OMD proposed in Daskalakis, Ilyas, Syrgkanis, and Zeng [2017] with improved rates.

preprint2019arXiv

Just Interpolate: Kernel "Ridgeless" Regression Can Generalize

In the absence of explicit regularization, Kernel "Ridgeless" Regression with nonlinear kernels has the potential to fit the training data perfectly. It has been observed empirically, however, that such interpolated solutions can still generalize well on test data. We isolate a phenomenon of implicit regularization for minimum-norm interpolated solutions which is due to a combination of high dimensionality of the input data, curvature of the kernel function, and favorable geometric properties of the data such as an eigenvalue decay of the empirical covariance and kernel matrices. In addition to deriving a data-dependent upper bound on the out-of-sample error, we present experimental evidence suggesting that the phenomenon occurs in the MNIST dataset.

preprint2019arXiv

Statistical Inference for the Population Landscape via Moment Adjusted Stochastic Gradients

Modern statistical inference tasks often require iterative optimization methods to compute the solution. Convergence analysis from an optimization viewpoint only informs us how well the solution is approximated numerically but overlooks the sampling nature of the data. In contrast, recognizing the randomness in the data, statisticians are keen to provide uncertainty quantification, or confidence, for the solution obtained using iterative optimization methods. This paper makes progress along this direction by introducing the moment-adjusted stochastic gradient descents, a new stochastic optimization method for statistical inference. We establish non-asymptotic theory that characterizes the statistical distribution for certain iterative methods with optimization guarantees. On the statistical front, the theory allows for model mis-specification, with very mild conditions on the data. For optimization, the theory is flexible for both convex and non-convex cases. Remarkably, the moment-adjusting idea motivated from "error standardization" in statistics achieves a similar effect as acceleration in first-order optimization methods used to fit generalized linear models. We also demonstrate this acceleration effect in the non-convex setting through numerical experiments.

preprint2019arXiv

Training Neural Networks as Learning Data-adaptive Kernels: Provable Representation and Approximation Benefits

Consider the problem: given the data pair $(\mathbf{x}, \mathbf{y})$ drawn from a population with $f_*(x) = \mathbf{E}[\mathbf{y} | \mathbf{x} = x]$, specify a neural network model and run gradient flow on the weights over time until reaching any stationarity. How does $f_t$, the function computed by the neural network at time $t$, relate to $f_*$, in terms of approximation and representation? What are the provable benefits of the adaptive representation by neural networks compared to the pre-specified fixed basis representation in the classical nonparametric literature? We answer the above questions via a dynamic reproducing kernel Hilbert space (RKHS) approach indexed by the training process of neural networks. Firstly, we show that when reaching any local stationarity, gradient flow learns an adaptive RKHS representation and performs the global least-squares projection onto the adaptive RKHS, simultaneously. Secondly, we prove that as the RKHS is data-adaptive and task-specific, the residual for $f_*$ lies in a subspace that is potentially much smaller than the orthogonal complement of the RKHS. The result formalizes the representation and approximation benefits of neural networks. Lastly, we show that the neural network function computed by gradient flow converges to the kernel ridgeless regression with an adaptive kernel, in the limit of vanishing regularization. The adaptive kernel viewpoint provides new angles of studying the approximation, representation, generalization, and optimization advantages of neural networks.

preprint2018arXiv

Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability

We study the detailed path-wise behavior of the discrete-time Langevin algorithm for non-convex Empirical Risk Minimization (ERM) through the lens of metastability, adopting some techniques from Berglund and Gentz (2003. For a particular local optimum of the empirical risk, with an arbitrary initialization, we show that, with high probability, at least one of the following two events will occur: (1) the Langevin trajectory ends up somewhere outside the $\varepsilon$-neighborhood of this particular optimum within a short recurrence time; (2) it enters this $\varepsilon$-neighborhood by the recurrence time and stays there until a potentially exponentially long escape time. We call this phenomenon empirical metastability. This two-timescale characterization aligns nicely with the existing literature in the following two senses. First, the effective recurrence time (i.e., number of iterations multiplied by stepsize) is dimension-independent, and resembles the convergence time of continuous-time deterministic Gradient Descent (GD). However unlike GD, the Langevin algorithm does not require strong conditions on local initialization, and has the possibility of eventually visiting all optima. Second, the scaling of the escape time is consistent with the Eyring-Kramers law, which states that the Langevin scheme will eventually visit all local minima, but it will take an exponentially long time to transit among them. We apply this path-wise concentration result in the context of statistical learning to examine local notions of generalization and optimality.

preprint2017arXiv

Adaptive Feature Selection: Computationally Efficient Online Sparse Linear Regression under RIP

Online sparse linear regression is an online problem where an algorithm repeatedly chooses a subset of coordinates to observe in an adversarially chosen feature vector, makes a real-valued prediction, receives the true label, and incurs the squared loss. The goal is to design an online learning algorithm with sublinear regret to the best sparse linear predictor in hindsight. Without any assumptions, this problem is known to be computationally intractable. In this paper, we make the assumption that data matrix satisfies restricted isometry property, and show that this assumption leads to computationally efficient algorithms with sublinear regret for two variants of the problem. In the first variant, the true label is generated according to a sparse linear model with additive Gaussian noise. In the second, the true label is chosen adversarially.

preprint2017arXiv

Weighted Message Passing and Minimum Energy Flow for Heterogeneous Stochastic Block Models with Side Information

We study the misclassification error for community detection in general heterogeneous stochastic block models (SBM) with noisy or partial label information. We establish a connection between the misclassification rate and the notion of minimum energy on the local neighborhood of the SBM. We develop an optimally weighted message passing algorithm to reconstruct labels for SBM based on the minimum energy flow and the eigenvectors of a certain Markov transition matrix. The general SBM considered in this paper allows for unequal-size communities, degree heterogeneity, and different connection probabilities among blocks. We focus on how to optimally weigh the message passing to improve misclassification.

preprint2016arXiv

On Detection and Structural Reconstruction of Small-World Random Networks

In this paper, we study detection and fast reconstruction of the celebrated Watts-Strogatz (WS) small-world random graph model \citep{watts1998collective} which aims to describe real-world complex networks that exhibit both high clustering and short average length properties. The WS model with neighborhood size $k$ and rewiring probability probability $β$ can be viewed as a continuous interpolation between a deterministic ring lattice graph and the Erdős-Rényi random graph. We study both the computational and statistical aspects of detecting the deterministic ring lattice structure (or local geographical links, strong ties) in the presence of random connections (or long range links, weak ties), and for its recovery. The phase diagram in terms of $(k,β)$ is partitioned into several regions according to the difficulty of the problem. We propose distinct methods for the various regions.

preprint2015arXiv

Computational and Statistical Boundaries for Submatrix Localization in a Large Noisy Matrix

The interplay between computational efficiency and statistical accuracy in high-dimensional inference has drawn increasing attention in the literature. In this paper, we study computational and statistical boundaries for submatrix localization. Given one observation of (one or multiple non-overlapping) signal submatrix (of magnitude $λ$ and size $k_m \times k_n$) contaminated with a noise matrix (of size $m \times n$), we establish two transition thresholds for the signal to noise $λ/σ$ ratio in terms of $m$, $n$, $k_m$, and $k_n$. The first threshold, $\sf SNR_c$, corresponds to the computational boundary. Below this threshold, it is shown that no polynomial time algorithm can succeed in identifying the submatrix, under the \textit{hidden clique hypothesis}. We introduce adaptive linear time spectral algorithms that identify the submatrix with high probability when the signal strength is above the threshold $\sf SNR_c$. The second threshold, $\sf SNR_s$, captures the statistical boundary, below which no method can succeed with probability going to one in the minimax sense. The exhaustive search method successfully finds the submatrix above this threshold. The results show an interesting phenomenon that $\sf SNR_c$ is always significantly larger than $\sf SNR_s$, which implies an essential gap between statistical optimality and computational efficiency for submatrix localization.

preprint2015arXiv

Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex Functions

We consider the problem of optimizing an approximately convex function over a bounded convex set in $\mathbb{R}^n$ using only function evaluations. The problem is reduced to sampling from an \emph{approximately} log-concave distribution using the Hit-and-Run method, which is shown to have the same $\mathcal{O}^*$ complexity as sampling from log-concave distributions. In addition to extend the analysis for log-concave distributions to approximate log-concave distributions, the implementation of the 1-dimensional sampler of the Hit-and-Run walk requires new methods and analysis. The algorithm then is based on simulated annealing which does not relies on first order conditions which makes it essentially immune to local minima. We then apply the method to different motivating problems. In the context of zeroth order stochastic convex optimization, the proposed method produces an $ε$-minimizer after $\mathcal{O}^*(n^{7.5}ε^{-2})$ noisy function evaluations by inducing a $\mathcal{O}(ε/n)$-approximately log concave distribution. We also consider in detail the case when the "amount of non-convexity" decays towards the optimum of the function. Other applications of the method discussed in this work include private computation of empirical risk minimizers, two-stage stochastic programming, and approximate dynamic programming for online learning.

preprint2015arXiv

Geometric Inference for General High-Dimensional Linear Inverse Problems

This paper presents a unified geometric framework for the statistical analysis of a general ill-posed linear inverse model which includes as special cases noisy compressed sensing, sign vector recovery, trace regression, orthogonal matrix estimation, and noisy matrix completion. We propose computationally feasible convex programs for statistical inference including estimation, confidence intervals and hypothesis testing. A theoretical framework is developed to characterize the local estimation rate of convergence and to provide statistical inference guarantees. Our results are built based on the local conic geometry and duality. The difficulty of statistical inference is captured by the geometric characterization of the local tangent cone through the Gaussian width and Sudakov minoration estimate.

preprint2015arXiv

Learning with Square Loss: Localization through Offset Rademacher Complexity

We consider regression with square loss and general classes of functions without the boundedness assumption. We introduce a notion of offset Rademacher complexity that provides a transparent way to study localization both in expectation and in high probability. For any (possibly non-convex) class, the excess loss of a two-step estimator is shown to be upper bounded by this offset complexity through a novel geometric inequality. In the convex case, the estimator reduces to an empirical risk minimizer. The method recovers the results of \citep{RakSriTsy15} for the bounded case while also providing guarantees without the boundedness assumption.