Source author record

Phan-Minh Nguyen

Phan-Minh Nguyen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.ST Statistics Theory Information Theory math.IT cond-mat.dis-nn cond-mat.stat-mech math.OC

Catalog footprint

What is connected

5works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

Analysis of feature learning in weight-tied autoencoders via the mean field lens

Autoencoders are among the earliest introduced nonlinear models for unsupervised learning. Although they are widely adopted beyond research, it has been a longstanding open problem to understand mathematically the feature extraction mechanism that trained nonlinear autoencoders provide. In this work, we make progress in this problem by analyzing a class of two-layer weight-tied nonlinear autoencoders in the mean field framework. Upon a suitable scaling, in the regime of a large number of neurons, the models trained with stochastic gradient descent are shown to admit a mean field limiting dynamics. This limiting description reveals an asymptotically precise picture of feature learning by these models: their training dynamics exhibit different phases that correspond to the learning of different principal subspaces of the data, with varying degrees of nonlinear shrinkage dependent on the $\ell_{2}$-regularization and stopping time. While we prove these results under an idealized assumption of (correlated) Gaussian data, experiments on real-life data demonstrate an interesting match with the theory. The autoencoder setup of interests poses a nontrivial mathematical challenge to proving these results. In this setup, the "Lipschitz" constants of the models grow with the data dimension $d$. Consequently an adaptation of previous analyses requires a number of neurons $N$ that is at least exponential in $d$. Our main technical contribution is a new argument which proves that the required $N$ is only polynomial in $d$. We conjecture that $N\gg d$ is sufficient and that $N$ is necessarily larger than a data-dependent intrinsic dimension, a behavior that is fundamentally different from previously studied setups.

preprint2020arXiv

A Note on the Global Convergence of Multilayer Neural Networks in the Mean Field Regime

In a recent work, we introduced a rigorous framework to describe the mean field limit of the gradient-based learning dynamics of multilayer neural networks, based on the idea of a neuronal embedding. There we also proved a global convergence guarantee for three-layer (as well as two-layer) networks using this framework. In this companion note, we point out that the insights in our previous work can be readily extended to prove a global convergence guarantee for multilayer networks of any depths. Unlike our previous three-layer global convergence guarantee that assumes i.i.d. initializations, our present result applies to a type of correlated initialization. This initialization allows to, at any finite training time, propagate a certain universal approximation property through the depth of the neural network. To achieve this effect, we introduce a bidirectional diversity condition.

preprint2018arXiv

A Mean Field View of the Landscape of Two-Layers Neural Networks

Multi-layer neural networks are among the most powerful models in machine learning, yet the fundamental reasons for this success defy mathematical understanding. Learning a neural network requires to optimize a non-convex high-dimensional objective (risk function), a problem which is usually attacked using stochastic gradient descent (SGD). Does SGD converge to a global optimum of the risk or only to a local optimum? In the first case, does this happen because local minima are absent, or because SGD somehow avoids them? In the second, why do local minima reached by SGD have good generalization properties? In this paper we consider a simple case, namely two-layers neural networks, and prove that -in a suitable scaling limit- SGD dynamics is captured by a certain non-linear partial differential equation (PDE) that we call distributional dynamics (DD). We then consider several specific examples, and show how DD can be used to prove convergence of SGD to networks with nearly ideal generalization error. This description allows to 'average-out' some of the complexities of the landscape of neural networks, and can be used to prove a general convergence result for noisy SGD.

preprint2016arXiv

Capacity of the Energy Harvesting Channel with a Finite Battery

We consider an energy harvesting channel, in which the transmitter is powered by an exogenous stochastic energy harvesting process $E_t$, such that $0\leq E_t\leq\bar{E}$, which can be stored in a battery of finite size $\bar{B}$. We provide a simple and insightful formula for the approximate capacity of this channel with bounded guarantee on the approximation gap independent of system parameters. This approximate characterization of the capacity identifies two qualitatively different operating regimes for this channel: in the large battery regime, when $\bar{B}\geq \bar{E}$, the capacity is approximately equal to that of an AWGN channel with an average power constraint equal to the average energy harvesting rate, i.e. it depends only on the mean of $E_t$ and is (almost) independent of the distribution of $E_t$ and the exact value of $\bar{B}$. In particular, this suggests that a battery size $\bar{B}\approx\bar{E}$ is approximately sufficient to extract the infinite battery capacity of the system. In the small battery regime, when $\bar{B}<\bar{E}$, we clarify the dependence of the capacity on the distribution of $E_t$ and the value of $\bar{B}$. There are three steps to proving this result which can be of interest in their own right: 1) we characterize the capacity of this channel as an $n$-letter mutual information rate under various assumptions on the availability of energy arrival information; 2) we characterize the approximately optimal online power control policy that maximizes the long-term average throughput of the system; 3) we show that the information-theoretic capacity of this channel is equal, within a constant gap, to its long-term average throughput. This last result provides a connection between the information- and communication-theoretic formulations of the energy-harvesting communication problem that have been so far studied in isolation.

preprint2015arXiv

On Capacity Formulation with Stationary Inputs and Application to a Bit-Patterned Media Recording Channel Model

In this correspondence, we illustrate among other things the use of the stationarity property of the set of capacity-achieving inputs in capacity calculations. In particular, as a case study, we consider a bit-patterned media recording channel model and formulate new lower and upper bounds on its capacity that yield improvements over existing results. Inspired by the observation that the new bounds are tight at low noise levels, we also characterize the capacity of this model as a series expansion in the low-noise regime. The key to these results is the realization of stationarity in the supremizing input set in the capacity formula. While the property is prevalent in capacity formulations in the ergodic-theoretic literature, we show that this realization is possible in the Shannon-theoretic framework where a channel is defined as a sequence of finite-dimensional conditional probabilities, by defining a new class of consistent stationary and ergodic channels.