Researcher profile

Zhi-Qin John Xu

Zhi-Qin John Xu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
11works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

11 published item(s)

preprint2026arXiv

Focus and Dilution: The Multi-stage Learning Process of Attention

Transformer-based models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus-dilution cycle in attention learning and provide a rigorous explanation in a one-layer Transformer setting for Markovian data via gradient-flow analysis. Using stage-wise linearization around critical points, we show that a single focus-dilution cycle can be decomposed into a sequence of distinct stages. First, embedding and projection rapidly condense to a rank-one structure, while attention parameters remain effectively frozen. Then, the attention parameters begin to increase, inducing a frequency-driven focus toward high-frequency tokens. As attention continues to evolve, it generates next-order perturbations in embeddings, leading to a mass-redistribution mechanism that progressively dilutes this focus. Finally, small asymmetries among low-frequency tokens lift a degenerate critical point, opening new embedding directions and initiating the next cycle. Experiments on synthetic Markovian data as well as WikiText and TinyStories corroborate the predicted stages and cyclical dynamics.

preprint2022arXiv

A deep learning-based model reduction (DeePMR) method for simplifying chemical kinetics

A deep learning-based model reduction (DeePMR) method for simplifying chemical kinetics is proposed and validated using high-temperature auto-ignitions, perfectly stirred reactors (PSR), and one-dimensional freely propagating flames of n-heptane/air mixtures. The mechanism reduction is modeled as an optimization problem on Boolean space, where a Boolean vector, each entry corresponding to a species, represents a reduced mechanism. The optimization goal is to minimize the reduced mechanism size given the error tolerance of a group of pre-selected benchmark quantities. The key idea of the DeePMR is to employ a deep neural network (DNN) to formulate the objective function in the optimization problem. In order to explore high dimensional Boolean space efficiently, an iterative DNN-assisted data sampling and DNN training procedure are implemented. The results show that DNN-assistance improves sampling efficiency significantly, selecting only $10^5$ samples out of $10^{34}$ possible samples for DNN to achieve sufficient accuracy. The results demonstrate the capability of the DNN to recognize key species and reasonably predict reduced mechanism performance. The well-trained DNN guarantees the optimal reduced mechanism by solving an inverse optimization problem. By comparing ignition delay times, laminar flame speeds, temperatures in PSRs, the resulting skeletal mechanism has fewer species (45 species) but the same level of accuracy as the skeletal mechanism (56 species) obtained by the Path Flux Analysis (PFA) method. In addition, the skeletal mechanism can be further reduced to 28 species if only considering atmospheric, near-stoichiometric conditions (equivalence ratio between 0.6 and 1.2). The DeePMR provides an innovative way to perform model reduction and demonstrates the great potential of data-driven methods in the combustion area.

preprint2022arXiv

A multi-scale sampling method for accurate and robust deep neural network to predict combustion chemical kinetics

Machine learning has long been considered as a black box for predicting combustion chemical kinetics due to the extremely large number of parameters and the lack of evaluation standards and reproducibility. The current work aims to understand two basic questions regarding the deep neural network (DNN) method: what data the DNN needs and how general the DNN method can be. Sampling and preprocessing determine the DNN training dataset, further affect DNN prediction ability. The current work proposes using Box-Cox transformation (BCT) to preprocess the combustion data. In addition, this work compares different sampling methods with or without preprocessing, including the Monte Carlo method, manifold sampling, generative neural network method (cycle-GAN), and newly-proposed multi-scale sampling. Our results reveal that the DNN trained by the manifold data can capture the chemical kinetics in limited configurations but cannot remain robust toward perturbation, which is inevitable for the DNN coupled with the flow field. The Monte Carlo and cycle-GAN samplings can cover a wider phase space but fail to capture small-scale intermediate species, producing poor prediction results. A three-hidden-layer DNN, based on the multi-scale method without specific flame simulation data, allows predicting chemical kinetics in various scenarios and being stable during the temporal evolutions. This single DNN is readily implemented with several CFD codes and validated in various combustors, including (1). zero-dimensional autoignition, (2). one-dimensional freely propagating flame, (3). two-dimensional jet flame with triple-flame structure, and (4). three-dimensional turbulent lifted flames. The results demonstrate the satisfying accuracy and generalization ability of the pre-trained DNN. The Fortran and Python versions of DNN and example code are attached in the supplementary for reproducibility.

preprint2022arXiv

An Upper Limit of Decaying Rate with Respect to Frequency in Deep Neural Network

Deep neural network (DNN) usually learns the target function from low to high frequency, which is called frequency principle or spectral bias. This frequency principle sheds light on a high-frequency curse of DNNs -- difficult to learn high-frequency information. Inspired by the frequency principle, a series of works are devoted to develop algorithms for overcoming the high-frequency curse. A natural question arises: what is the upper limit of the decaying rate w.r.t. frequency when one trains a DNN? In this work, our theory, confirmed by numerical experiments, suggests that there is a critical decaying rate w.r.t. frequency in DNN training. Below the upper limit of the decaying rate, the DNN interpolates the training data by a function with a certain regularity. However, above the upper limit, the DNN interpolates the training data by a trivial function, i.e., a function is only non-zero at training data points. Our results indicate a better way to overcome the high-frequency curse is to design a proper pre-condition approach to shift high-frequency information to low-frequency one, which coincides with several previous developed algorithms for fast learning high-frequency information. More importantly, this work rigorously proves that the high-frequency curse is an intrinsic difficulty of DNNs.

preprint2022arXiv

Dropout in Training Neural Networks: Flatness of Solution and Noise Structure

It is important to understand how the popular regularization method dropout helps the neural network training find a good generalization solution. In this work, we show that the training with dropout finds the neural network with a flatter minimum compared with standard gradient descent training. We further find that the variance of a noise induced by the dropout is larger at the sharper direction of the loss landscape and the Hessian of the loss landscape at the found minima aligns with the noise covariance matrix by experiments on various datasets, i.e., MNIST, CIFAR-10, CIFAR-100 and Multi30k, and various structures, i.e., fully-connected networks, large residual convolutional networks and transformer. For networks with piece-wise linear activation function and the dropout is only at the last hidden layer, we then theoretically derive the Hessian and the covariance of dropout randomness, where these two quantities are very similar. This similarity may be a key reason accounting for the goodness of dropout.

preprint2022arXiv

Embedding Principle of Loss Landscape of Deep Neural Networks

Understanding the structure of loss landscape of deep neural networks (DNNs)is obviously important. In this work, we prove an embedding principle that the loss landscape of a DNN "contains" all the critical points of all the narrower DNNs. More precisely, we propose a critical embedding such that any critical point, e.g., local or global minima, of a narrower DNN can be embedded to a critical point/hyperplane of the target DNN with higher degeneracy and preserving the DNN output function. The embedding structure of critical points is independent of loss function and training data, showing a stark difference from other nonconvex problems such as protein-folding. Empirically, we find that a wide DNN is often attracted by highly-degenerate critical points that are embedded from narrow DNNs. The embedding principle provides an explanation for the general easy optimization of wide DNNs and unravels a potential implicit low-complexity regularization during the training. Overall, our work provides a skeleton for the study of loss landscape of DNNs and its implication, by which a more exact and comprehensive understanding can be anticipated in the near

preprint2021arXiv

Frequency Principle in Deep Learning Beyond Gradient-descent-based Training

Frequency perspective recently makes progress in understanding deep learning. It has been widely verified in both empirical and theoretical studies that deep neural networks (DNNs) often fit the target function from low to high frequency, namely Frequency Principle (F-Principle). F-Principle sheds light on the strength and the weakness of DNNs and inspires a series of subsequent works, including theoretical studies, empirical studies and the design of efficient DNN structures etc. Previous works examine the F-Principle in gradient-descent-based training. It remains unclear whether gradient-descent-based training is a necessary condition for the F-Principle. In this paper, we show that the F-Principle exists stably in the training process of DNNs with non-gradient-descent-based training, including optimization algorithms with gradient information, such as conjugate gradient and BFGS, and algorithms without gradient information, such as Powell's method and Particle Swarm Optimization. These empirical studies show the universality of the F-Principle and provide hints for further study of F-Principle.

preprint2021arXiv

MOD-Net: A Machine Learning Approach via Model-Operator-Data Network for Solving PDEs

In this paper, we propose a a machine learning approach via model-operator-data network (MOD-Net) for solving PDEs. A MOD-Net is driven by a model to solve PDEs based on operator representation with regularization from data. For linear PDEs, we use a DNN to parameterize the Green's function and obtain the neural operator to approximate the solution according to the Green's method. To train the DNN, the empirical risk consists of the mean squared loss with the least square formulation or the variational formulation of the governing equation and boundary conditions. For complicated problems, the empirical risk also includes a few labels, which are computed on coarse grid points with cheap computation cost and significantly improves the model accuracy. Intuitively, the labeled dataset works as a regularization in addition to the model constraints. The MOD-Net solves a family of PDEs rather than a specific one and is much more efficient than original neural operator because few expensive labels are required. We numerically show MOD-Net is very efficient in solving Poisson equation and one-dimensional radiative transfer equation. For nonlinear PDEs, the nonlinear MOD-Net can be similarly used as an ansatz for solving nonlinear PDEs, exemplified by solving several nonlinear PDE problems, such as the Burgers equation.

preprint2020arXiv

A priori generalization error for two-layer ReLU neural network through minimum norm solution

We focus on estimating \emph{a priori} generalization error of two-layer ReLU neural networks (NNs) trained by mean squared error, which only depends on initial parameters and the target function, through the following research line. We first estimate \emph{a priori} generalization error of finite-width two-layer ReLU NN with constraint of minimal norm solution, which is proved by \cite{zhang2019type} to be an equivalent solution of a linearized (w.r.t. parameter) finite-width two-layer NN. As the width goes to infinity, the linearized NN converges to the NN in Neural Tangent Kernel (NTK) regime \citep{jacot2018neural}. Thus, we can derive the \emph{a priori} generalization error of two-layer ReLU NN in NTK regime. The distance between NN in a NTK regime and a finite-width NN with gradient training is estimated by \cite{arora2019exact}. Based on the results in \cite{arora2019exact}, our work proves an \emph{a priori} generalization error bound of two-layer ReLU NNs. This estimate uses the intrinsic implicit bias of the minimum norm solution without requiring extra regularity in the loss function. This \emph{a priori} estimate also implies that NN does not suffer from curse of dimensionality, and a small generalization error can be achieved without requiring exponentially large number of neurons. In addition the research line proposed in this paper can also be used to study other properties of the finite-width network, such as the posterior generalization error.

preprint2020arXiv

A regularized deep matrix factorized model of matrix completion for image restoration

It has been an important approach of using matrix completion to perform image restoration. Most previous works on matrix completion focus on the low-rank property by imposing explicit constraints on the recovered matrix, such as the constraint of the nuclear norm or limiting the dimension of the matrix factorization component. Recently, theoretical works suggest that deep linear neural network has an implicit bias towards low rank on matrix completion. However, low rank is not adequate to reflect the intrinsic characteristics of a natural image. Thus, algorithms with only the constraint of low rank are insufficient to perform image restoration well. In this work, we propose a Regularized Deep Matrix Factorized (RDMF) model for image restoration, which utilizes the implicit bias of the low rank of deep neural networks and the explicit bias of total variation. We demonstrate the effectiveness of our RDMF model with extensive experiments, in which our method surpasses the state of art models in common examples, especially for the restoration from very few observations. Our work sheds light on a more general framework for solving other inverse problems by combining the implicit bias of deep learning with explicit regularization.

preprint2020arXiv

A type of generalization error induced by initialization in deep neural networks

How initialization and loss function affect the learning of a deep neural network (DNN), specifically its generalization error, is an important problem in practice. In this work, by exploiting the linearity of DNN training dynamics in the NTK regime \citep{jacot2018neural,lee2019wide}, we provide an explicit and quantitative answer to this problem. Focusing on regression problem, we prove that, in the NTK regime, for any loss in a general class of functions, the DNN finds the same \emph{global} minima---the one that is nearest to the initial value in the parameter space, or equivalently, the one that is closest to the initial DNN output in the corresponding reproducing kernel Hilbert space. Using these optimization problems, we quantify the impact of initial output and prove that a random non-zero one increases the generalization error. We further propose an antisymmetrical initialization (ASI) trick that eliminates this type of error and accelerates the training. To understand whether the above results hold in general, we also perform experiments for DNNs in the non-NTK regime, which demonstrate the effectiveness of our theoretical results and the ASI trick in a qualitative sense. Overall, our work serves as a baseline for the further investigation of the impact of initialization and loss function on the generalization of DNNs, which can potentially guide and improve the training of DNNs in practice.