Researcher profile

Lexing Ying

Lexing Ying contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
29works
0followers
12topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

29 published item(s)

preprint2024arXiv

Multimodal Sampling via Approximate Symmetries

Sampling from multimodal distributions is a challenging task in scientific computing. When a distribution has an exact symmetry between the modes, direct jumps among them can accelerate the samplings significantly. However, the distributions from most applications do not have exact symmetries. This paper considers the distributions with approximate symmetries. We first construct an exactly symmetric reference distribution from the target one by averaging over the group orbit associated with the approximate symmetry. Next, we can apply the multilevel Monte Carlo methods by constructing a continuation path between the reference and target distributions. We discuss how to implement these steps with annealed importance sampling and tempered transitions. Compared with traditional multilevel methods, the proposed approach can be more effective since the reference and target distributions are much closer. Numerical results of the Ising models are presented to illustrate the efficiency of the proposed method.

preprint2023arXiv

Variational Actor-Critic Algorithms

We introduce a class of variational actor-critic algorithms based on a variational formulation over both the value function and the policy. The objective function of the variational formulation consists of two parts: one for maximizing the value function and the other for minimizing the Bellman residual. Besides the vanilla gradient descent with both the value function and the policy updates, we propose two variants, the clipping method and the flipping method, in order to speed up the convergence. We also prove that, when the prefactor of the Bellman residual is sufficiently large, the fixed point of the algorithm is close to the optimal policy.

preprint2022arXiv

A semigroup method for high dimensional elliptic PDEs and eigenvalue problems based on neural networks

In this paper, we propose a semigroup method for solving high-dimensional elliptic partial differential equations (PDEs) and the associated eigenvalue problems based on neural networks. For the PDE problems, we reformulate the original equations as variational problems with the help of semigroup operators and then solve the variational problems with neural network (NN) parameterization. The main advantages are that no mixed second-order derivative computation is needed during the stochastic gradient descent training and that the boundary conditions are taken into account automatically by the semigroup operator. Unlike popular methods like PINN \cite{raissi2019physics} and Deep Ritz \cite{weinan2018deep} where the Dirichlet boundary condition is enforced solely through penalty functions and thus changes the true solution, the proposed method is able to address the boundary conditions without penalty functions and it gives the correct true solution even when penalty functions are added, thanks to the semigroup operator. For eigenvalue problems, a primal-dual method is proposed, efficiently resolving the constraint with a simple scalar dual variable and resulting in a faster algorithm compared with the BSDE solver \cite{han2020solving} in certain problems such as the eigenvalue problem associated with the linear Schrödinger operator. Numerical results are provided to demonstrate the performance of the proposed methods.

preprint2022arXiv

Analytic continuation from limited noisy Matsubara data

This note proposes a new algorithm for estimating spectral function from limited noisy Matsubara data. We consider both the molecule and condensed matter cases. In each case, the algorithm constructs an interpolant of the Matsubara data and uses conformal mapping and Prony's method to estimate the spectral function. Numerical results are provided to demonstrate the performance of the algorithm.

preprint2022arXiv

Annealed importance sampling for Ising models with mixed boundary conditions

This note introduces a method for sampling Ising models with mixed boundary conditions. As an application of annealed importance sampling and the Swendsen-Wang algorithm, the method adopts a sequence of intermediate distributions that keeps the temperature fixed but turns on the boundary condition gradually. The numerical results show that the variance of the sample weights is relatively small.

preprint2022arXiv

Beyond the Quadratic Approximation: the Multiscale Structure of Neural Network Loss Landscapes

A quadratic approximation of neural network loss landscapes has been extensively used to study the optimization process of these networks. Though, it usually holds in a very small neighborhood of the minimum, it cannot explain many phenomena observed during the optimization process. In this work, we study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of a good quadratic approximation. Numerically, we observe that neural network loss functions possesses a multiscale structure, manifested in two ways: (1) in a neighborhood of minima, the loss mixes a continuum of scales and grows subquadratically, and (2) in a larger region, the loss shows several separate scales clearly. Using the subquadratic growth, we are able to explain the Edge of Stability phenomenon [5] observed for the gradient descent (GD) method. Using the separate scales, we explain the working mechanism of learning rate decay by simple examples. Finally, we study the origin of the multiscale structure and propose that the non-convexity of the models and the non-uniformity of training data is one of the causes. By constructing a two-layer neural network problem we show that training data with different magnitudes give rise to different scales of the loss function, producing subquadratic growth and multiple separate scales.

preprint2022arXiv

Correcting Convexity Bias in Function and Functional Estimate

A general framework with a series of different methods is proposed to improve the estimate of convex function (or functional) values when only noisy observations of the true input are available. Technically, our methods catch the bias introduced by the convexity and remove this bias from a baseline estimate. Theoretical analysis are conducted to show that the proposed methods can strictly reduce the expected estimate error under mild conditions. When applied, the methods require no specific knowledge about the problem except the convexity and the evaluation of the function. Therefore, they can serve as off-the-shelf tools to obtain good estimate for a wide range of problems, including optimization problems with random objective functions or constraints, and functionals of probability distributions such as the entropy and the Wasserstein distance. Numerical experiments on a wide variety of problems show that our methods can significantly improve the quality of the estimate compared with the baseline method.

preprint2022arXiv

Double Flip Move for Ising Models with Mixed Boundary Conditions

This note introduces the double flip move for accelerating the Swendsen-Wang algorithm for Ising models with mixed boundary conditions below the critical temperature. The double flip move consists of a geometric flip of the spin lattice followed by a spin value flip. Both the symmetric and approximately symmetric models are considered. We prove the detailed balance of the double flip move and demonstrate its empirical efficiency in mixing.

preprint2022arXiv

Enterprise-Scale Search: Accelerating Inference for Sparse Extreme Multi-Label Ranking Trees

Tree-based models underpin many modern semantic search engines and recommender systems due to their sub-linear inference times. In industrial applications, these models operate at extreme scales, where every bit of performance is critical. Memory constraints at extreme scales also require that models be sparse, hence tree-based models are often back-ended by sparse matrix algebra routines. However, there are currently no sparse matrix techniques specifically designed for the sparsity structure one encounters in tree-based models for extreme multi-label ranking/classification (XMR/XMC) problems. To address this issue, we present the masked sparse chunk multiplication (MSCM) technique, a sparse matrix technique specifically tailored to XMR trees. MSCM is easy to implement, embarrassingly parallelizable, and offers a significant performance boost to any existing tree inference pipeline at no cost. We perform a comprehensive study of MSCM applied to several different sparse inference schemes and benchmark our methods on a general purpose extreme multi-label ranking framework. We observe that MSCM gives consistently dramatic speedups across both the online and batch inference settings, single- and multi-threaded settings, and on many different tree models and datasets. To demonstrate its utility in industrial applications, we apply MSCM to an enterprise-scale semantic product search problem with 100 million products and achieve sub-millisecond latency of 0.88 ms per query on a single thread -- an 8x reduction in latency over vanilla inference techniques. The MSCM technique requires absolutely no sacrifices to model accuracy as it gives exactly the same results as standard sparse matrix techniques. Therefore, we believe that MSCM will enable users of XMR trees to save a substantial amount of compute resources in their inference pipelines at very little cost.

preprint2022arXiv

Operator Shifting for General Noisy Matrix Systems

In the computational sciences, one must often estimate model parameters from data subject to noise and uncertainty, leading to inaccurate results. In order to improve the accuracy of models with noisy parameters, we consider the problem of reducing error in a linear system with the operator corrupted by noise. Our contribution in this paper is to extend the elliptic operator shifting framework from Etter, Ying '20 to the general nonsymmetric matrix case. Roughly, the operator shifting technique is a matrix analogue of the James-Stein estimator. The key insight is that a shift of the matrix inverse estimate in an appropriately chosen direction will reduce average error. In our extension, we interrogate a number of questions -- namely, whether or not shifting towards the origin for general matrix inverses always reduces error as it does in the elliptic case. We show that this is usually the case, but that there are three key features of the general nonsingular matrices that allow for adversarial examples not possible in the symmetric case. We prove that when these adversarial possibilities are eliminated by the assumption of noise symmetry and the use of the residual norm as the error metric, the optimal shift is always towards the origin, mirroring results from Etter, Ying '20. We also investigate behavior in the small noise regime and other scenarios. We conclude by presenting numerical experiments (with accompanying source code) inspired by Reinforcement Learning to demonstrate that operator shifting can yield substantial reductions in error.

preprint2022arXiv

Operator Shifting for Noisy Elliptic Systems

In the computational sciences, one must often estimate model parameters from data subject to noise and uncertainty, leading to inaccurate results. In order to improve the accuracy of models with noisy parameters, we consider the problem of reducing error in an elliptic linear system with the operator corrupted by noise. We assume the noise preserves positive definiteness, but otherwise, we make no additional assumptions the structure of the noise. Under these assumptions, we propose the operator shifting framework, a collection of easy-to-implement algorithms that augment a noisy inverse operator by subtracting an additional auxiliary term. In a similar fashion to the James-Stein estimator, this has the effect of drawing the noisy inverse operator closer to the ground truth, and hence reduces error by reducing both bias and variance. We develop bootstrap Monte Carlo algorithms to estimate the required augmentation magnitude for optimal error reduction in the noisy system. To improve the tractability of these algorithms, we propose several approximate polynomial expansions for the operator inverse, and prove desirable convergence and monotonicity properties for these expansions. We also prove theorems that quantify the error reduction obtained by operator augmentation. In addition to theoretical results, we provide a set of numerical experiments on four different graph and grid Laplacian systems that all demonstrate effectiveness of our method.

preprint2022arXiv

Provably convergent quasistatic dynamics for mean-field two-player zero-sum games

In this paper, we study the problem of finding mixed Nash equilibrium for mean-field two-player zero-sum games. Solving this problem requires optimizing over two probability distributions. We consider a quasistatic Wasserstein gradient flow dynamics in which one probability distribution follows the Wasserstein gradient flow, while the other one is always at the equilibrium. Theoretical analysis are conducted on this dynamics, showing its convergence to the mixed Nash equilibrium under mild conditions. Inspired by the continuous dynamics of probability distributions, we derive a quasistatic Langevin gradient descent method with inner-outer iterations, and test the method on different problems, including training mixture of GANs.

preprint2021arXiv

A Simple Multiscale Method for Mean Field Games

This paper proposes a multiscale method for solving the numerical solution of mean field games which accelerates the convergence and addresses the problem of determining the initial guess. Starting from an approximate solution at the coarsest level, the method constructs approximations on successively finer grids via alternating sweeping, which not only allows for the use of classical time marching numerical schemes but also enables applications to both local and nonlocal problems. At each level, numerical relaxation is used to stabilize the iterative process. A second-order discretization scheme is derived for higher-order convergence. Numerical examples are provided to demonstrate the efficiency of the proposed method in both local and nonlocal, 1-dimensional and 2-dimensional cases.

preprint2021arXiv

How to Learn when Data Reacts to Your Model: Performative Gradient Descent

Performative distribution shift captures the setting where the choice of which ML model is deployed changes the data distribution. For example, a bank which uses the number of open credit lines to determine a customer's risk of default on a loan may induce customers to open more credit lines in order to improve their chances of being approved. Because of the interactions between the model and data distribution, finding the optimal model parameters is challenging. Works in this area have focused on finding stable points, which can be far from optimal. Here we introduce performative gradient descent (PerfGD), which is the first algorithm which provably converges to the performatively optimal point. PerfGD explicitly captures how changes in the model affects the data distribution and is simple to use. We support our findings with theory and experiments.

preprint2021arXiv

Multi-Level Fine-Tuning: Closing Generalization Gaps in Approximation of Solution Maps under a Limited Budget for Training Data

In scientific machine learning, regression networks have been recently applied to approximate solution maps (e.g., potential-ground state map of Schrödinger equation). In this paper, we aim to reduce the generalization error without spending more time in generating training samples. However, to reduce the generalization error, the regression network needs to be fit on a large number of training samples (e.g., a collection of potential-ground state pairs). The training samples can be produced by running numerical solvers, which takes much time in many applications. In this paper, we aim to reduce the generalization error without spending more time in generating training samples. Inspired by few-shot learning techniques, we develop the Multi-Level Fine-Tuning algorithm by introducing levels of training: we first train the regression network on samples generated at the coarsest grid and then successively fine-tune the network on samples generated at finer grids. Within the same amount of time, numerical solvers generate more samples on coarse grids than on fine grids. We demonstrate a significant reduction of generalization error in numerical experiments on challenging problems with oscillations, discontinuities, or rough coefficients. Further analysis can be conducted in the Neural Tangent Kernel regime and we provide practical estimators to the generalization error. The number of training samples at different levels can be optimized for the smallest estimated generalization error under the constraint of budget for training data. The optimized distribution of budget over levels provides practical guidance with theoretical insight as in the celebrated Multi-Level Monte Carlo algorithm.

preprint2021arXiv

Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy

Motivated by modern applications, such as online advertisement and recommender systems, we study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous, and the learner is allowed to select $k$ arms and observe all or some of the rewards for the chosen arms. We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy for selecting multiple arms. We show that our algorithm has a regret guarantee of $O(k\sqrt{(A-k+1)T \log (|\mathcal{F}|T)})$, where $A$ is the total number of arms and $\mathcal{F}$ is the class containing the regression function, while only requiring $\tilde{O}(A)$ computation per time step. In the extreme setting, where the total number of arms can be in the millions, we propose a practically-motivated arm hierarchy model that induces a certain structure in mean rewards to ensure statistical and computational efficiency. The hierarchical structure allows for an exponential reduction in the number of relevant arms for each context, thus resulting in a regret guarantee of $O(k\sqrt{(\log A-k+1)T \log (|\mathcal{F}|T)})$. Finally, we implement our algorithm using a hierarchical linear function class and show superior performance with respect to well-known benchmarks on simulated bandit feedback experiments using extreme multi-label classification datasets. On a dataset with three million arms, our reduction scheme has an average inference time of only 7.9 milliseconds, which is a 100x improvement.

preprint2020arXiv

A heuristic independent particle approximation to determinantal point processes

A determinantal point process is a stochastic point process that is commonly used to capture negative correlations. It has become increasingly popular in machine learning in recent years. Sampling a determinantal point process however remains a computationally intensive task. This note introduces a heuristic independent particle approximation to determinantal point processes. The approximation is based on the physical intuition of fermions and is implemented using standard numerical linear algebra routines. Sampling from this independent particle approximation can be performed at a negligible cost. Numerical results are provided to demonstrate the performance of the proposed algorithm.

preprint2020arXiv

A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth

Training deep neural networks with stochastic gradient descent (SGD) can often achieve zero training loss on real-world tasks although the optimization landscape is known to be highly non-convex. To understand the success of SGD for training deep neural networks, this work presents a mean-field analysis of deep residual networks, based on a line of works that interpret the continuum limit of the deep residual network as an ordinary differential equation when the network capacity tends to infinity. Specifically, we propose a new continuum limit of deep residual networks, which enjoys a good landscape in the sense that every local minimizer is global. This characterization enables us to derive the first global convergence result for multilayer neural networks in the mean-field regime. Furthermore, without assuming the convexity of the loss landscape, our proof relies on a zero-loss assumption at the global minimizer that can be achieved when the model shares a universal approximation property. Key to our result is the observation that a deep residual network resembles a shallow network ensemble, i.e. a two-layer network. We bound the difference between the shallow network and our ResNet model via the adjoint sensitivity method, which enables us to apply existing mean-field analyses of two-layer networks to deep networks. Furthermore, we propose several novel training schemes based on the new continuous model, including one training procedure that switches the order of the residual blocks and results in strong empirical performance on the benchmark datasets.

preprint2020arXiv

A Sharp Convergence Rate for the Asynchronous Stochastic Gradient Descent

We give a sharp convergence rate for the asynchronous stochastic gradient descent (ASGD) algorithms when the loss function is a perturbed quadratic function based on the stochastic modified equations introduced in [An et al. Stochastic modified equations for the asynchronous stochastic gradient descent, arXiv:1805.08244]. We prove that when the number of local workers is larger than the expected staleness, then ASGD is more efficient than stochastic gradient descent. Our theoretical result also suggests that longer delays result in slower convergence rate. Besides, the learning rate cannot be smaller than a threshold inversely proportional to the expected staleness.

preprint2020arXiv

A simple solver for the fractional Laplacian in multiple dimensions

We present a simple discretization scheme for the hypersingular integral representation of the fractional Laplace operator and solver for the corresponding fractional Laplacian problem. Through singularity subtraction, we obtain a regularized integrand that is amenable to the trapezoidal rule with equispaced nodes, assuming a high degree of regularity in the underlying function (i.e., $u\in C^6(R^d)$). The resulting quadrature scheme gives a discrete operator on a regular grid that is translation-invariant and thus can be applied quickly with the fast Fourier transform. For discretizations of problems related to space-fractional diffusion on bounded domains, we observe that the underlying linear system can be efficiently solved via preconditioned Krylov methods with a preconditioner based on the finite-difference (non-fractional) Laplacian. We show numerical results illustrating the error of our simple scheme as well the efficiency of our preconditioning approach, both for the elliptic (steady-state) fractional diffusion problem and the time-dependent problem.

preprint2020arXiv

Borrowing From the Future: Addressing Double Sampling in Model-free Control

In model-free reinforcement learning, the temporal difference method and its variants become unstable when combined with nonlinear function approximations. Bellman residual minimization with stochastic gradient descent (SGD) is more stable, but it suffers from the double sampling problem: given the current state, two independent samples for the next state are required, but often only one sample is available. Recently, the authors of [Zhu et al, 2020] introduced the borrowing from the future (BFF) algorithm to address this issue for the prediction problem. The main idea is to borrow extra randomness from the future to approximately re-sample the next state when the underlying dynamics of the problem are sufficiently smooth. This paper extends the BFF algorithm to action-value function based model-free control. We prove that BFF is close to unbiased SGD when the underlying dynamics vary slowly with respect to actions. We confirm our theoretical findings with numerical simulations.

preprint2020arXiv

Distributed-memory $\mathcal{H}$-matrix Algebra I: Data Distribution and Matrix-vector Multiplication

We introduce a data distribution scheme for $\mathcal{H}$-matrices and a distributed-memory algorithm for $\mathcal{H}$-matrix-vector multiplication. Our data distribution scheme avoids an expensive $Ω(P^2)$ scheduling procedure used in previous work, where $P$ is the number of processes, while data balancing is well-preserved. Based on the data distribution, our distributed-memory algorithm evenly distributes all computations among $P$ processes and adopts a novel tree-communication algorithm to reduce the latency cost. The overall complexity of our algorithm is $O\Big(\frac{N \log N}{P} + α\log P + β\log^2 P \Big)$ for $\mathcal{H}$-matrices under weak admissibility condition, where $N$ is the matrix size, $α$ denotes the latency, and $β$ denotes the inverse bandwidth. Numerically, our algorithm is applied to address both two- and three-dimensional problems of various sizes among various numbers of processes. On thousands of processes, good parallel efficiency is still observed.

preprint2020arXiv

Meta-learning Pseudo-differential Operators with Deep Neural Networks

This paper introduces a meta-learning approach for parameterized pseudo-differential operators with deep neural networks. With the help of the nonstandard wavelet form, the pseudo-differential operators can be approximated in a compressed form with a collection of vectors. The nonlinear map from the parameter to this collection of vectors and the wavelet transform are learned together from a small number of matrix-vector multiplications of the pseudo-differential operator. Numerical results for Green's functions of elliptic partial differential equations and the radiative transfer equations demonstrate the efficiency and accuracy of the proposed approach.

preprint2020arXiv

Mirror Descent Algorithms for Minimizing Interacting Free Energy

This note considers the problem of minimizing interacting free energy. Motivated by the mirror descent algorithm, for a given interacting free energy, we propose a descent dynamics with a novel metric that takes into consideration the reference measure and the interacting term. This metric naturally suggests a monotone reparameterization of the probability measure. By discretizing the reparameterized descent dynamics with the explicit Euler method, we arrive at a new mirror-descent-type algorithm for minimizing interacting free energy. Numerical results are included to demonstrate the efficiency of the proposed algorithms.

preprint2020arXiv

Natural Gradient for Combined Loss Using Wavelets

Natural gradients have been widely used in optimization of loss functionals over probability space, with important examples such as Fisher-Rao gradient descent for Kullback-Leibler divergence, Wasserstein gradient descent for transport-related functionals, and Mahalanobis gradient descent for quadratic loss functionals. This note considers the situation in which the loss is a convex linear combination of these examples. We propose a new natural gradient algorithm by utilizing compactly supported wavelets to diagonalize approximately the Hessian of the combined loss. Numerical results are included to demonstrate the efficiency of the proposed algorithm.

preprint2020arXiv

Semidefinite relaxation of multi-marginal optimal transport for strictly correlated electrons in second quantization

We consider the strictly correlated electron (SCE) limit of the fermionic quantum many-body problem in the second-quantized formalism. This limit gives rise to a multi-marginal optimal transport (MMOT) problem. Here the marginal state space for our MMOT problem is the binary set $\{0,1\}$, and the number of marginals is the number $L$ of sites in the model. The costs of storing and computing the exact solution of the MMOT problem both scale exponentially with respect to $L$. We propose an efficient convex relaxation which can be solved by semidefinite programming (SDP). In particular, the semidefinite constraint is only of size $2L\times 2L$. Moreover, the SDP-based method yields an approximation of the dual potential needed to the perform self-consistent field iteration in the so-called Kohn-Sham SCE framework, which, once converged, yields a lower bound for the total energy of the system. We demonstrate the effectiveness of our methods on spinless and spinful Hubbard-type models. Numerical results indicate that our relaxation methods yield tight lower bounds for the optimal cost, in the sense that the error due to the semidefinite relaxation is much smaller than the intrinsic modeling error of the Kohn-Sham SCE method. We also describe how our relaxation methods generalize to arbitrary MMOT problems with pairwise cost functions.

preprint2019arXiv

Solving Electrical Impedance Tomography with Deep Learning

This paper introduces a new approach for solving electrical impedance tomography (EIT) problems using deep neural networks. The mathematical problem of EIT is to invert the electrical conductivity from the Dirichlet-to-Neumann (DtN) map. Both the forward map from the electrical conductivity to the DtN map and the inverse map are high-dimensional and nonlinear. Motivated by the linear perturbative analysis of the forward map and based on a numerically low-rank property, we propose compact neural network architectures for the forward and inverse maps for both 2D and 3D problems. Numerical results demonstrate the efficiency of the proposed neural networks.

preprint2019arXiv

Stochastic modified equations for the asynchronous stochastic gradient descent

We propose a stochastic modified equations (SME) for modeling the asynchronous stochastic gradient descent (ASGD) algorithms. The resulting SME of Langevin type extracts more information about the ASGD dynamics and elucidates the relationship between different types of stochastic gradient algorithms. We show the convergence of ASGD to the SME in the continuous time limit, as well as the SME's precise prediction to the trajectories of ASGD with various forcing terms. As an application of the SME, we propose an optimal mini-batching strategy for ASGD via solving the optimal control problem of the associated SME.