Source author record

Liu Ziyin

Liu Ziyin appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning cond-mat.stat-mech cond-mat.dis-nn Distributed, Parallel, and Cluster Computing Information Theory math.IT physics.app-ph physics.soc-ph q-fin.TR

Catalog footprint

What is connected

9works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Exact Phase Transitions in Deep Learning

This work reports deep-learning-unique first-order and second-order phase transitions, whose phenomenology closely follows that in statistical physics. In particular, we prove that the competition between prediction error and model complexity in the training loss leads to the second-order phase transition for nets with one hidden layer and the first-order phase transition for nets with more than one hidden layer. The proposed theory is directly relevant to the optimization of neural networks and points to an origin of the posterior collapse problem in Bayesian deep learning.

preprint2022arXiv

Power Laws and Symmetries in a Minimal Model of Financial Market Economy

A financial market is a system resulting from the complex interaction between participants in a closed economy. We propose a minimal microscopic model of the financial market economy based on the real economy's symmetry constraint and minimality requirement. We solve the proposed model analytically in the mean-field regime, which shows that various kinds of universal power-law-like behaviors in the financial market may depend on one another, just like the critical exponents in physics. We then discuss the parameters in the proposed model, and we show that each parameter in our model can be related to measurable quantities in the real market, which enables us to discuss the cause of a few kinds of social and economic phenomena.

preprint2022arXiv

Power-law escape rate of SGD

Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a random time change. Using this formalism, we show that the log loss barrier $Δ\log L=\log[L(θ^s)/L(θ^*)]$ between a local minimum $θ^*$ and a saddle $θ^s$ determines the escape rate of SGD from the local minimum, contrary to the previous results borrowing from physics that the linear loss barrier $ΔL=L(θ^s)-L(θ^*)$ decides the escape rate. Our escape-rate formula strongly depends on the typical magnitude $h^*$ and the number $n$ of the outlier eigenvalues of the Hessian. This result explains an empirical fact that SGD prefers flat minima with low effective dimensions, giving an insight into implicit biases of SGD.

preprint2022arXiv

Stochastic Neural Networks with Infinite Width are Deterministic

This work theoretically studies stochastic neural networks, a main type of neural network in use. We prove that as the width of an optimized stochastic neural network tends to infinity, its predictive variance on the training set decreases to zero. Our theory justifies the common intuition that adding stochasticity to the model can help regularize the model by introducing an averaging effect. Two common examples that our theory can be relevant to are neural networks with dropout and Bayesian latent variable models in a special limit. Our result thus helps better understand how stochasticity affects the learning of neural networks and potentially design better architectures for practical problems.

preprint2022arXiv

Strength of Minibatch Noise in SGD

The noise in stochastic gradient descent (SGD), caused by minibatch sampling, is poorly understood despite its practical importance in deep learning. This work presents the first systematic study of the SGD noise and fluctuations close to a local minimum. We first analyze the SGD noise in linear regression in detail and then derive a general formula for approximating SGD noise in different types of minima. For application, our results (1) provide insight into the stability of training a neural network, (2) suggest that a large learning rate can help generalization by introducing an implicit regularization, (3) explain why the linear learning rate-batchsize scaling law fails at a large learning rate or at a small batchsize and (4) can provide an understanding of how discrete-time nature of SGD affects the recently discovered power-law phenomenon of SGD.

preprint2022arXiv

Universal Thermodynamic Uncertainty Relation in Non-Equilibrium Dynamics

We derive a universal thermodynamic uncertainty relation (TUR) that applies to an arbitrary observable in a general Markovian system. The generality of our result allows us to make two findings: (1) for an arbitrary out-of-equilibrium system, both the entropy production and the \textit{degree of non-stationarity} are required to tightly bound the strength of a thermodynamic current; (2) by removing the antisymmetric constraint on observables, the TUR in physics and a fundamental inequality in theoretical finance can be unified in a single framework.

preprint2020arXiv

Learning Not to Learn in the Presence of Noisy Labels

Learning in the presence of label noise is a challenging yet important task: it is crucial to design models that are robust in the presence of mislabeled datasets. In this paper, we discover that a new class of loss functions called the gambler's loss provides strong robustness to label noise across various levels of corruption. We show that training with this loss function encourages the model to "abstain" from learning on the data points with noisy labels, resulting in a simple and effective method to improve robustness and generalization. In addition, we propose two practical extensions of the method: 1) an analytical early stopping criterion to approximately stop training before the memorization of noisy labels, as well as 2) a heuristic for setting hyperparameters which do not require knowledge of the noise corruption rate. We demonstrate the effectiveness of our method by achieving strong results across three image and text classification tasks as compared to existing baselines.

preprint2020arXiv

Think Locally, Act Globally: Federated Learning with Local and Global Representations

Federated learning is a method of training models on private data distributed over multiple devices. To keep device data private, the global model is trained by only communicating parameters and updates which poses scalability challenges for large models. To this end, we propose a new federated learning algorithm that jointly learns compact local representations on each device and a global model across all devices. As a result, the global model can be smaller since it only operates on local representations, reducing the number of communicated parameters. Theoretically, we provide a generalization analysis which shows that a combination of local and global models reduces both variance in the data as well as variance across device distributions. Empirically, we demonstrate that local models enable communication-efficient training while retaining performance. We also evaluate on the task of personalized mood prediction from real-world mobile data where privacy is key. Finally, local models handle heterogeneous data from new devices, and learn fair representations that obfuscate protected attributes such as race, age, and gender.

preprint2020arXiv

Volumization as a Natural Generalization of Weight Decay

We propose a novel regularization method, called \textit{volumization}, for neural networks. Inspired by physics, we define a physical volume for the weight parameters in neural networks, and we show that this method is an effective way of regularizing neural networks. Intuitively, this method interpolates between an $L_2$ and $L_\infty$ regularization. Therefore, weight decay and weight clipping become special cases of the proposed algorithm. We prove, on a toy example, that the essence of this method is a regularization technique to control bias-variance tradeoff. The method is shown to do well in the categories where the standard weight decay method is shown to work well, including improving the generalization of networks and preventing memorization. Moreover, we show that the volumization might lead to a simple method for training a neural network whose weight is binary or ternary.

Liu Ziyin

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

Exact Phase Transitions in Deep Learning

Power Laws and Symmetries in a Minimal Model of Financial Market Economy

Power-law escape rate of SGD

Stochastic Neural Networks with Infinite Width are Deterministic

Strength of Minibatch Noise in SGD

Universal Thermodynamic Uncertainty Relation in Non-Equilibrium Dynamics

Learning Not to Learn in the Presence of Noisy Labels

Think Locally, Act Globally: Federated Learning with Local and Global Representations

Volumization as a Natural Generalization of Weight Decay