Researcher profile

Xin T. Tong

Xin T. Tong contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2022arXiv

Can We Do Better Than Random Start? The Power of Data Outsourcing

Many organizations have access to abundant data but lack the computational power to process the data. While they can outsource the computational task to other facilities, there are various constraints on the amount of data that can be shared. It is natural to ask what can data outsourcing accomplish under such constraints. We address this question from a machine learning perspective. When training a model with optimization algorithms, the quality of the results often relies heavily on the points where the algorithms are initialized. Random start is one of the most popular methods to tackle this issue, but it can be computationally expensive and not feasible for organizations lacking computing resources. Based on three different scenarios, we propose simulation-based algorithms that can utilize a small amount of outsourced data to find good initial points accordingly. Under suitable regularity conditions, we provide theoretical guarantees showing the algorithms can find good initial points with high probability. We also conduct numerical experiments to demonstrate that our algorithms perform significantly better than the random start approach.

preprint2022arXiv

Convergence Speed and Approximation Accuracy of Numerical MCMC

When implementing Markov Chain Monte Carlo (MCMC) algorithms, perturbation caused by numerical errors is sometimes inevitable. This paper studies how perturbation of MCMC affects the convergence speed and Monte Carlo estimation accuracy. Our results show that when the original Markov chain converges to stationarity fast enough and the perturbed transition kernel is a good approximation to the original transition kernel, the corresponding perturbed sampler has similar convergence speed and high approximation accuracy as well. We discuss two different analysis frameworks: ergodicity and spectral gap, both are widely used in the literature. Our results can be easily extended to obtain non-asymptotic error bounds for MCMC estimators. We also demonstrate how to apply our convergence and approximation results to the analysis of specific sampling algorithms, including Random walk Metropolis and Metropolis adjusted Langevin algorithm with perturbed target densities, and parallel tempering Monte Carlo with perturbed densities. Finally we present some simple numerical examples to verify our theoretical claims.

preprint2022arXiv

Stochastic Gradient Descent with Dependent Data for Offline Reinforcement Learning

In reinforcement learning (RL), offline learning decoupled learning from data collection and is useful in dealing with exploration-exploitation tradeoff and enables data reuse in many applications. In this work, we study two offline learning tasks: policy evaluation and policy learning. For policy evaluation, we formulate it as a stochastic optimization problem and show that it can be solved using approximate stochastic gradient descent (aSGD) with time-dependent data. We show aSGD achieves $\tilde O(1/t)$ convergence when the loss function is strongly convex and the rate is independent of the discount factor $γ$. This result can be extended to include algorithms making approximately contractive iterations such as TD(0). The policy evaluation algorithm is then combined with the policy iteration algorithm to learn the optimal policy. To achieve an $ε$ accuracy, the complexity of the algorithm is $\tilde O(ε^{-2}(1-γ)^{-5})$, which matches the complexity bound for classic online RL algorithms such as Q-learning.

preprint2021arXiv

Adaptive Tikhonov strategies for stochastic ensemble Kalman inversion

Ensemble Kalman inversion (EKI) is a derivative-free optimizer aimed at solving inverse problems, taking motivation from the celebrated ensemble Kalman filter. The purpose of this article is to consider the introduction of adaptive Tikhonov strategies for EKI. This work builds upon Tikhonov EKI (TEKI) which was proposed for a fixed regularization constant. By adaptively learning the regularization parameter, this procedure is known to improve the recovery of the underlying unknown. For the analysis, we consider a continuous-time setting where we extend known results such as well-posdeness and convergence of various loss functions, but with the addition of noisy observations. Furthermore, we allow a time-varying noise and regularization covariance in our presented convergence result which mimic adaptive regularization schemes. In turn we present three adaptive regularization schemes, which are highlighted from both the deterministic and Bayesian approaches for inverse problems, which include bilevel optimization, the MAP formulation and covariance learning. We numerically test these schemes and the theory on linear and nonlinear partial differential equations, where they outperform the non-adaptive TEKI and EKI.

preprint2021arXiv

Consistency analysis of bilevel data-driven learning in inverse problems

One fundamental problem when solving inverse problems is how to find regularization parameters. This article considers solving this problem using data-driven bilevel optimization, i.e. we consider the adaptive learning of the regularization parameter from data by means of optimization. This approach can be interpreted as solving an empirical risk minimization problem, and we analyze its performance in the large data sample size limit for general nonlinear problems. We demonstrate how to implement our framework on linear inverse problems, where we can further show the inverse accuracy does not depend on the ambient space dimension. To reduce the associated computational cost, online numerical schemes are derived using the stochastic gradient descent method. We prove convergence of these numerical schemes under suitable assumptions on the forward problem. Numerical experiments are presented illustrating the theoretical results and demonstrating the applicability and efficiency of the proposed approaches for various linear and nonlinear inverse problems, including Darcy flow, the eikonal equation, and an image denoising example.

preprint2021arXiv

Dimension Independent Generalization Error by Stochastic Gradient Descent

One classical canon of statistics is that large models are prone to overfitting, and model selection procedures are necessary for high dimensional data. However, many overparameterized models, such as neural networks, perform very well in practice, although they are often trained with simple online methods and regularization. The empirical success of overparameterized models, which is often known as benign overfitting, motivates us to have a new look at the statistical generalization theory for online optimization. In particular, we present a general theory on the generalization error of stochastic gradient descent (SGD) solutions for both convex and locally convex loss functions. We further discuss data and model conditions that lead to a ``low effective dimension". Under these conditions, we show that the generalization error either does not depend on the ambient dimension $p$ or depends on $p$ via a poly-logarithmic factor. We also demonstrate that in several widely used statistical models, the ``low effective dimension'' arises naturally in overparameterized settings. The studied statistical applications include both convex models such as linear regression and logistic regression and non-convex models such as $M$-estimator and two-layer neural networks.

preprint2020arXiv

Accelerating Metropolis-within-Gibbs sampler with localized computations of differential equations

Inverse problem is ubiquitous in science and engineering, and Bayesian methodologies are often used to infer the underlying parameters. For high dimensional temporal-spatial models, classical Markov chain Monte Carlo (MCMC) methods are often slow to converge, and it is necessary to apply Metropolis-within-Gibbs (MwG) sampling on parameter blocks. However, the computation cost of each MwG iteration is typically $O(n^2)$, where $n$ is the model dimension. This can be too expensive in practice. This paper introduces a new reduced computation method to bring down the computation cost to $O(n)$, for the inverse initial value problem of a stochastic differential equation (SDE) with local interactions. The key observation is that each MwG proposal is only different from the original iterate at one parameter block, and this difference will only propagate within a local domain in the SDE computations. Therefore we can approximate the global SDE computation with a surrogate updated only within the local domain for reduced computation cost. Both theoretically and numerically, we show that the approximation errors can be controlled by the local domain size. We discuss how to implement the local computation scheme using Euler--Maruyama and 4th order Runge--Kutta methods. We numerically demonstrate the performance of the proposed method with the Lorenz 96 model and a linear stochastic flow model.

preprint2020arXiv

Analysis of a localised nonlinear Ensemble Kalman Bucy Filter with complete and accurate observations

Concurrent observation technologies have made high-precision real-time data available in large quantities. Data assimilation (DA) is concerned with how to combine this data with physical models to produce accurate predictions. For spatial-temporal models, the Ensemble Kalman Filter with proper localization techniques is considered to be a state-of-the-art DA methodology. This article proposes and investigates a localized Ensemble Kalman Bucy Filter (l-EnKBF) for nonlinear models with short-range interactions. We derive dimension-independent and component-wise error bounds and show the long time path-wise error only has logarithmic dependence on the time range. The theoretical results are verified through some simple numerical tests.

preprint2020arXiv

On Stationary-Point Hitting Time and Ergodicity of Stochastic Gradient Langevin Dynamics

Stochastic gradient Langevin dynamics (SGLD) is a fundamental algorithm in stochastic optimization. Recent work by Zhang et al. [2017] presents an analysis for the hitting time of SGLD for the first and second order stationary points. The proof in Zhang et al. [2017] is a two-stage procedure through bounding the Cheeger's constant, which is rather complicated and leads to loose bounds. In this paper, using intuitions from stochastic differential equations, we provide a direct analysis for the hitting times of SGLD to the first and second order stationary points. Our analysis is straightforward. It only relies on basic linear algebra and probability theory tools. Our direct analysis also leads to tighter bounds comparing to Zhang et al. [2017] and shows the explicit dependence of the hitting time on different factors, including dimensionality, smoothness, noise strength, and step size effects. Under suitable conditions, we show that the hitting time of SGLD to first-order stationary points can be dimension-independent. Moreover, we apply our analysis to study several important online estimation problems in machine learning, including linear regression, matrix factorization, and online PCA.

preprint2020arXiv

Spectral Gap of Replica Exchange Langevin Diffusion on Mixture Distributions

Langevin diffusion (LD) is one of the main workhorses for sampling problems. However, its convergence rate can be significantly reduced if the target distribution is a mixture of multiple densities, especially when each component concentrates around a different mode. Replica exchange Langevin diffusion (ReLD) is a sampling method that can circumvent this issue. In particular, ReLD adds another LD sampling a high-temperature version of the target density, and exchange the locations of two LDs according to a Metropolis-Hasting type of law. This approach can be further extended to multiple replica exchange Langevin diffusion (mReLD), where $K$ additional LDs are added to sample distributions at different temperatures and exchanges take place between neighboring-temperature processes. While ReLD and mReLD have been used extensively in statistical physics, molecular dynamics, and other applications, there is little existing analysis on its convergence rate and choices of temperatures. This paper closes these gaps assuming the target distribution is a mixture of log-concave densities. We show ReLD can obtain constant or even better convergence rates even when the density components of the mixture concentrate around isolated modes. We also show using mReLD with $K$ additional LDs can achieve the same result while the exchange frequency only needs to be $(1/K)$-th power of the one in ReLD.