Source author record

Yingbin Liang

Yingbin Liang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.OC Information Theory math.IT Cryptography and Security math.ST Statistics Theory

Catalog footprint

What is connected

36works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs

Reinforcement learning (RL) is a fundamental framework for sequential decision-making, in which an agent learns an optimal policy through interactions with an unknown environment. In settings with function approximation, many existing RL algorithms achieve favorable sample complexity, but often rely on computationally intractable oracles. In this paper, we use supervised learning as a computational proxy to establish a clear hierarchy of commonly adopted RL oracles under low-rank Markov Decision Processes (MDPs). This hierarchy shows that policy evaluation is the most computationally efficient oracle, provided that supervised learning can be efficiently solved. Motivated by this observation, we propose a novel optimistic actor-critic algorithm that relies solely on the policy evaluation oracle. We prove that our algorithm outperforms the existing sample complexity guarantees for low-rank MDPs while avoiding computationally expensive planning or optimization oracles commonly assumed in prior works. We further extend our theoretical results to approximately low-rank MDPs and demonstrate that this setting captures a broad class of real-world environments. Finally, we validate our theoretical results with experiments on several standard Gym environments.

preprint2022arXiv

A Primal-Dual Approach to Bilevel Optimization with Multiple Inner Minima

Bilevel optimization has found extensive applications in modern machine learning problems such as hyperparameter optimization, neural architecture search, meta-learning, etc. While bilevel problems with a unique inner minimal point (e.g., where the inner function is strongly convex) are well understood, such a problem with multiple inner minimal points remains to be challenging and open. Existing algorithms designed for such a problem were applicable to restricted situations and do not come with a full guarantee of convergence. In this paper, we adopt a reformulation of bilevel optimization to constrained optimization, and solve the problem via a primal-dual bilevel optimization (PDBO) algorithm. PDBO not only addresses the multiple inner minima challenge, but also features fully first-order efficiency without involving second-order Hessian and Jacobian computations, as opposed to most existing gradient-based bilevel algorithms. We further characterize the convergence rate of PDBO, which serves as the first known non-asymptotic convergence guarantee for bilevel optimization with multiple inner minima. Our experiments demonstrate desired performance of the proposed approach.

preprint2022arXiv

Data Sampling Affects the Complexity of Online SGD over Dependent Data

Conventional machine learning applications typically assume that data samples are independently and identically distributed (i.i.d.). However, practical scenarios often involve a data-generating process that produces highly dependent data samples, which are known to heavily bias the stochastic optimization process and slow down the convergence of learning. In this paper, we conduct a fundamental study on how different stochastic data sampling schemes affect the sample complexity of online stochastic gradient descent (SGD) over highly dependent data. Specifically, with a $ϕ$-mixing model of data dependence, we show that online SGD with proper periodic data-subsampling achieves an improved sample complexity over the standard online SGD in the full spectrum of the data dependence level. Interestingly, even subsampling a subset of data samples can accelerate the convergence of online SGD over highly dependent data. Moreover, we show that online SGD with mini-batch sampling can further substantially improve the sample complexity over online SGD with periodic data-subsampling over highly dependent data. Numerical experiments validate our theoretical results.

preprint2022arXiv

Lower Bounds and Accelerated Algorithms for Bilevel Optimization

Bilevel optimization has recently attracted growing interests due to its wide applications in modern machine learning problems. Although recent studies have characterized the convergence rate for several such popular algorithms, it is still unclear how much further these convergence rates can be improved. In this paper, we address this fundamental question from two perspectives. First, we provide the first-known lower complexity bounds of $\widetildeΩ(\frac{1}{\sqrt{μ_x}μ_y})$ and $\widetilde Ω\big(\frac{1}{\sqrtε}\min\{\frac{1}{μ_y},\frac{1}{\sqrt{ε^{3}}}\}\big)$ respectively for strongly-convex-strongly-convex and convex-strongly-convex bilevel optimizations. Second, we propose an accelerated bilevel optimizer named AccBiO, for which we provide the first-known complexity bounds without the gradient boundedness assumption (which was made in existing analyses) under the two aforementioned geometries. We also provide significantly tighter upper bounds than the existing complexity when the bounded gradient assumption does hold. We show that AccBiO achieves the optimal results (i.e., the upper and lower bounds match up to logarithmic factors) when the inner-level problem takes a quadratic form with a constant-level condition number. Interestingly, our lower bounds under both geometries are larger than the corresponding optimal complexities of minimax optimization, establishing that bilevel optimization is provably more challenging than minimax optimization.

preprint2022arXiv

Model-Based Offline Meta-Reinforcement Learning with Regularization

Existing offline reinforcement learning (RL) methods face a few major challenges, particularly the distributional shift between the learned policy and the behavior policy. Offline Meta-RL is emerging as a promising approach to address these challenges, aiming to learn an informative meta-policy from a collection of tasks. Nevertheless, as shown in our empirical studies, offline Meta-RL could be outperformed by offline single-task RL methods on tasks with good quality of datasets, indicating that a right balance has to be delicately calibrated between "exploring" the out-of-distribution state-actions by following the meta-policy and "exploiting" the offline dataset by staying close to the behavior policy. Motivated by such empirical analysis, we explore model-based offline Meta-RL with regularized Policy Optimization (MerPO), which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions. In particular, we devise a new meta-Regularized model-based Actor-Critic (RAC) method for within-task policy optimization, as a key building block of MerPO, using conservative policy evaluation and regularized policy improvement; and the intrinsic tradeoff therein is achieved via striking the right balance between two regularizers, one based on the behavior policy and the other on the meta-policy. We theoretically show that the learnt policy offers guaranteed improvement over both the behavior policy and the meta-policy, thus ensuring the performance improvement on new tasks via offline Meta-RL. Experiments corroborate the superior performance of MerPO over existing offline Meta-RL methods.

preprint2022arXiv

On the Convergence Theory for Hessian-Free Bilevel Algorithms

Bilevel optimization has arisen as a powerful tool in modern machine learning. However, due to the nested structure of bilevel optimization, even gradient-based methods require second-order derivative approximations via Jacobian- or/and Hessian-vector computations, which can be costly and unscalable in practice. Recently, Hessian-free bilevel schemes have been proposed to resolve this issue, where the general idea is to use zeroth- or first-order methods to approximate the full hypergradient of the bilevel problem. However, we empirically observe that such approximation can lead to large variance and unstable training, but estimating only the response Jacobian matrix as a partial component of the hypergradient turns out to be extremely effective. To this end, we propose a new Hessian-free method, which adopts the zeroth-order-like method to approximate the response Jacobian matrix via taking difference between two optimization paths. Theoretically, we provide the convergence rate analysis for the proposed algorithms, where our key challenge is to characterize the approximation and smoothness properties of the trajectory-dependent estimator, which can be of independent interest. This is the first known convergence rate result for this type of Hessian-free bilevel algorithms. Experimentally, we demonstrate that the proposed algorithms outperform baseline bilevel optimizers on various bilevel problems. Particularly, in our experiment on few-shot meta-learning with ResNet-12 network over the miniImageNet dataset, we show that our algorithm outperforms baseline meta-learning algorithms, while other baseline bilevel optimizers do not solve such meta-learning problems within a comparable time frame.

preprint2022arXiv

PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method

Emphatic temporal difference (ETD) learning (Sutton et al., 2016) is a successful method to conduct the off-policy value function evaluation with function approximation. Although ETD has been shown to converge asymptotically to a desirable value function, it is well-known that ETD often encounters a large variance so that its sample complexity can increase exponentially fast with the number of iterations. In this work, we propose a new ETD method, called PER-ETD (i.e., PEriodically Restarted-ETD), which restarts and updates the follow-on trace only for a finite period for each iteration of the evaluation parameter. Further, PER-ETD features a design of the logarithmical increase of the restart period with the number of iterations, which guarantees the best trade-off between the variance and bias and keeps both vanishing sublinearly. We show that PER-ETD converges to the same desirable fixed point as ETD, but improves the exponential sample complexity of ETD to be polynomials. Our experiments validate the superior performance of PER-ETD and its advantage over ETD.

preprint2022arXiv

Provable Generalization of Overparameterized Meta-learning Trained with SGD

Despite the superior empirical success of deep meta-learning, theoretical understanding of overparameterized meta-learning is still limited. This paper studies the generalization of a widely used meta-learning approach, Model-Agnostic Meta-Learning (MAML), which aims to find a good initialization for fast adaptation to new tasks. Under a mixed linear regression model, we analyze the generalization properties of MAML trained with SGD in the overparameterized regime. We provide both upper and lower bounds for the excess risk of MAML, which captures how SGD dynamics affect these generalization bounds. With such sharp characterizations, we further explore how various learning parameters impact the generalization capability of overparameterized MAML, including explicitly identifying typical data and task distributions that can achieve diminishing generalization error with overparameterization, and characterizing the impact of adaptation learning rate on both excess risk and the early stopping time. Our theoretical findings are further validated by experiments.

preprint2022arXiv

Will Bilevel Optimizers Benefit from Loops

Bilevel optimization has arisen as a powerful tool for solving a variety of machine learning problems. Two current popular bilevel optimizers AID-BiO and ITD-BiO naturally involve solving one or two sub-problems, and consequently, whether we solve these problems with loops (that take many iterations) or without loops (that take only a few iterations) can significantly affect the overall computational efficiency. Existing studies in the literature cover only some of those implementation choices, and the complexity bounds available are not refined enough to enable rigorous comparison among different implementations. In this paper, we first establish unified convergence analysis for both AID-BiO and ITD-BiO that are applicable to all implementation choices of loops. We then specialize our results to characterize the computational complexity for all implementations, which enable an explicit comparison among them. Our result indicates that for AID-BiO, the loop for estimating the optimal point of the inner function is beneficial for overall efficiency, although it causes higher complexity for each update step, and the loop for approximating the outer-level Hessian-inverse-vector product reduces the gradient complexity. For ITD-BiO, the two loops always coexist, and our convergence upper and lower bounds show that such loops are necessary to guarantee a vanishing convergence error, whereas the no-loop scheme suffers from an unavoidable non-vanishing convergence error. Our numerical experiments further corroborate our theoretical results.

preprint2021arXiv

Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms

The actor-critic (AC) algorithm is a popular method to find an optimal policy in reinforcement learning. In the infinite horizon scenario, the finite-sample convergence rate for the AC and natural actor-critic (NAC) algorithms has been established recently, but under independent and identically distributed (i.i.d.) sampling and single-sample update at each iteration. In contrast, this paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling, with mini-batch data for each iteration, and with actor having general policy class approximation. We show that the overall sample complexity for a mini-batch AC to attain an $ε$-accurate stationary point improves the best known sample complexity of AC by an order of $\mathcal{O}(ε^{-1}\log(1/ε))$, and the overall sample complexity for a mini-batch NAC to attain an $ε$-accurate globally optimal point improves the existing sample complexity of NAC by an order of $\mathcal{O}(ε^{-1}/\log(1/ε))$. Moreover, the sample complexity of AC and NAC characterized in this work outperforms that of policy gradient (PG) and natural policy gradient (NPG) by a factor of $\mathcal{O}((1-γ)^{-3})$ and $\mathcal{O}((1-γ)^{-4}ε^{-1}/\log(1/ε))$, respectively. This is the first theoretical study establishing that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic.

preprint2021arXiv

Proximal Gradient Descent-Ascent: Variable Convergence under KŁ Geometry

The gradient descent-ascent (GDA) algorithm has been widely applied to solve minimax optimization problems. In order to achieve convergent policy parameters for minimax optimization, it is important that GDA generates convergent variable sequences rather than convergent sequences of function values or gradient norms. However, the variable convergence of GDA has been proved only under convexity geometries, and there lacks understanding for general nonconvex minimax optimization. This paper fills such a gap by studying the convergence of a more general proximal-GDA for regularized nonconvex-strongly-concave minimax optimization. Specifically, we show that proximal-GDA admits a novel Lyapunov function, which monotonically decreases in the minimax optimization process and drives the variable sequence to a critical point. By leveraging this Lyapunov function and the KŁ geometry that parameterizes the local geometries of general nonconvex functions, we formally establish the variable convergence of proximal-GDA to a critical point $x^*$, i.e., $x_t\to x^*, y_t\to y^*(x^*)$. Furthermore, over the full spectrum of the KŁ-parameterized geometry, we show that proximal-GDA achieves different types of convergence rates ranging from sublinear convergence up to finite-step convergence, depending on the geometry associated with the KŁ parameter. This is the first theoretical result on the variable convergence for nonconvex minimax optimization.

preprint2020arXiv

Analysis of Q-learning with Adaptation and Momentum Restart for Gradient Descent

Existing convergence analyses of Q-learning mostly focus on the vanilla stochastic gradient descent (SGD) type of updates. Despite the Adaptive Moment Estimation (Adam) has been commonly used for practical Q-learning algorithms, there has not been any convergence guarantee provided for Q-learning with such type of updates. In this paper, we first characterize the convergence rate for Q-AMSGrad, which is the Q-learning algorithm with AMSGrad update (a commonly adopted alternative of Adam for theoretical analysis). To further improve the performance, we propose to incorporate the momentum restart scheme to Q-AMSGrad, resulting in the so-called Q-AMSGradR algorithm. The convergence rate of Q-AMSGradR is also established. Our experiments on a linear quadratic regulator problem show that the two proposed Q-learning algorithms outperform the vanilla Q-learning with SGD updates. The two algorithms also exhibit significantly better performance than the DQN learning method over a batch of Atari 2600 games.

preprint2020arXiv

Guaranteed Recovery of One-Hidden-Layer Neural Networks via Cross Entropy

We study model recovery for data classification, where the training labels are generated from a one-hidden-layer neural network with sigmoid activations, also known as a single-layer feedforward network, and the goal is to recover the weights of the neural network. We consider two network models, the fully-connected network (FCN) and the non-overlapping convolutional neural network (CNN). We prove that with Gaussian inputs, the empirical risk based on cross entropy exhibits strong convexity and smoothness {\em uniformly} in a local neighborhood of the ground truth, as soon as the sample complexity is sufficiently large. This implies that if initialized in this neighborhood, gradient descent converges linearly to a critical point that is provably close to the ground truth. Furthermore, we show such an initialization can be obtained via the tensor method. This establishes the global convergence guarantee for empirical risk minimization using cross entropy via gradient descent for learning one-hidden-layer neural networks, at the near-optimal sample and computational complexity with respect to the network input dimension without unrealistic assumptions such as requiring a fresh set of samples at each iteration.

preprint2020arXiv

History-Gradient Aided Batch Size Adaptation for Variance Reduced Algorithms

Variance-reduced algorithms, although achieve great theoretical performance, can run slowly in practice due to the periodic gradient estimation with a large batch of data. Batch-size adaptation thus arises as a promising approach to accelerate such algorithms. However, existing schemes either apply prescribed batch-size adaption rule or exploit the information along optimization path via additional backtracking and condition verification steps. In this paper, we propose a novel scheme, which eliminates backtracking line search but still exploits the information along optimization path by adapting the batch size via history stochastic gradients. We further theoretically show that such a scheme substantially reduces the overall complexity for popular variance-reduced algorithms SVRG and SARAH/SPIDER for both conventional nonconvex optimization and reinforcement learning problems. To this end, we develop a new convergence analysis framework to handle the dependence of the batch size on history stochastic gradients. Extensive experiments validate the effectiveness of the proposed batch-size adaptation scheme.

preprint2020arXiv

Momentum Q-learning with Finite-Sample Convergence Guarantee

Existing studies indicate that momentum ideas in conventional optimization can be used to improve the performance of Q-learning algorithms. However, the finite-sample analysis for momentum-based Q-learning algorithms is only available for the tabular case without function approximations. This paper analyzes a class of momentum-based Q-learning algorithms with finite-sample guarantee. Specifically, we propose the MomentumQ algorithm, which integrates the Nesterov's and Polyak's momentum schemes, and generalizes the existing momentum-based Q-learning algorithms. For the infinite state-action space case, we establish the convergence guarantee for MomentumQ with linear function approximations and Markovian sampling. In particular, we characterize the finite-sample convergence rate which is provably faster than the vanilla Q-learning. This is the first finite-sample analysis for momentum-based Q-learning algorithms with function approximations. For the tabular case under synchronous sampling, we also obtain a finite-sample convergence rate that is slightly better than the SpeedyQ \citep{azar2011speedy} when choosing a special family of step sizes. Finally, we demonstrate through various experiments that the proposed MomentumQ outperforms other momentum-based Q-learning algorithms.

preprint2020arXiv

Non-asymptotic Convergence Analysis of Two Time-scale (Natural) Actor-Critic Algorithms

As an important type of reinforcement learning algorithms, actor-critic (AC) and natural actor-critic (NAC) algorithms are often executed in two ways for finding optimal policies. In the first nested-loop design, actor's one update of policy is followed by an entire loop of critic's updates of the value function, and the finite-sample analysis of such AC and NAC algorithms have been recently well established. The second two time-scale design, in which actor and critic update simultaneously but with different learning rates, has much fewer tuning parameters than the nested-loop design and is hence substantially easier to implement. Although two time-scale AC and NAC have been shown to converge in the literature, the finite-sample convergence rate has not been established. In this paper, we provide the first such non-asymptotic convergence rate for two time-scale AC and NAC under Markovian sampling and with actor having general policy class approximation. We show that two time-scale AC requires the overall sample complexity at the order of $\mathcal{O}(ε^{-2.5}\log^3(ε^{-1}))$ to attain an $ε$-accurate stationary point, and two time-scale NAC requires the overall sample complexity at the order of $\mathcal{O}(ε^{-4}\log^2(ε^{-1}))$ to attain an $ε$-accurate global optimal point. We develop novel techniques for bounding the bias error of the actor due to dynamically changing Markovian sampling and for analyzing the convergence rate of the linear critic with dynamically changing base functions and transition kernel.

preprint2020arXiv

Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling

Despite the wide applications of Adam in reinforcement learning (RL), the theoretical convergence of Adam-type RL algorithms has not been established. This paper provides the first such convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning that incorporate AMSGrad updates (a standard alternative of Adam in theoretical analysis), referred to as PG-AMSGrad and TD-AMSGrad, respectively. Moreover, our analysis focuses on Markovian sampling for both algorithms. We show that under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $\mathcal{O}(1/T)$ (where $T$ denotes the number of iterations), and with a diminishing stepsize converges exactly to a stationary point at the rate of $\mathcal{O}(\log^2 T/\sqrt{T})$. Furthermore, under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $\mathcal{O}(1/T)$, and with a diminishing stepsize converges exactly to the global optimum at the rate of $\mathcal{O}(\log T/\sqrt{T})$. Our study develops new techniques for analyzing the Adam-type RL algorithms under Markovian sampling.

preprint2020arXiv

Proximal Gradient Algorithm with Momentum and Flexible Parameter Restart for Nonconvex Optimization

Various types of parameter restart schemes have been proposed for accelerated gradient algorithms to facilitate their practical convergence in convex optimization. However, the convergence properties of accelerated gradient algorithms under parameter restart remain obscure in nonconvex optimization. In this paper, we propose a novel accelerated proximal gradient algorithm with parameter restart (named APG-restart) for solving nonconvex and nonsmooth problems. Our APG-restart is designed to 1) allow for adopting flexible parameter restart schemes that cover many existing ones; 2) have a global sub-linear convergence rate in nonconvex and nonsmooth optimization; and 3) have guaranteed convergence to a critical point and have various types of asymptotic convergence rates depending on the parameterization of local geometry in nonconvex and nonsmooth optimization. Numerical experiments demonstrate the effectiveness of our proposed algorithm.

preprint2020arXiv

Reanalysis of Variance Reduced Temporal Difference Learning

Temporal difference (TD) learning is a popular algorithm for policy evaluation in reinforcement learning, but the vanilla TD can substantially suffer from the inherent optimization variance. A variance reduced TD (VRTD) algorithm was proposed by Korda and La (2015), which applies the variance reduction technique directly to the online TD learning with Markovian samples. In this work, we first point out the technical errors in the analysis of VRTD in Korda and La (2015), and then provide a mathematically solid analysis of the non-asymptotic convergence of VRTD and its variance reduction performance. We show that VRTD is guaranteed to converge to a neighborhood of the fixed-point solution of TD at a linear convergence rate. Furthermore, the variance error (for both i.i.d.\ and Markovian sampling) and the bias error (for Markovian sampling) of VRTD are significantly reduced by the batch size of variance reduction in comparison to those of vanilla TD. As a result, the overall computational complexity of VRTD to attain a given accurate solution outperforms that of TD under Markov sampling and outperforms that of TD under i.i.d.\ sampling for a sufficiently small conditional number.

preprint2020arXiv

Robust Stochastic Bandit Algorithms under Probabilistic Unbounded Adversarial Attack

The multi-armed bandit formalism has been extensively studied under various attack models, in which an adversary can modify the reward revealed to the player. Previous studies focused on scenarios where the attack value either is bounded at each round or has a vanishing probability of occurrence. These models do not capture powerful adversaries that can catastrophically perturb the revealed reward. This paper investigates the attack model where an adversary attacks with a certain probability at each round, and its attack value can be arbitrary and unbounded if it attacks. Furthermore, the attack value does not necessarily follow a statistical distribution. We propose a novel sample median-based and exploration-aided UCB algorithm (called med-E-UCB) and a median-based $ε$-greedy algorithm (called med-$ε$-greedy). Both of these algorithms are provably robust to the aforementioned attack model. More specifically we show that both algorithms achieve $\mathcal{O}(\log T)$ pseudo-regret (i.e., the optimal regret without attacks). We also provide a high probability guarantee of $\mathcal{O}(\log T)$ regret with respect to random rewards and random occurrence of attacks. These bounds are achieved under arbitrary and unbounded reward perturbation as long as the attack probability does not exceed a certain constant threshold. We provide multiple synthetic simulations of the proposed algorithms to verify these claims and showcase the inability of existing techniques to achieve sublinear regret. We also provide experimental results of the algorithm operating in a cognitive radio setting using multiple software-defined radios.

preprint2020arXiv

SpiderBoost and Momentum: Faster Stochastic Variance Reduction Algorithms

SARAH and SPIDER are two recently developed stochastic variance-reduced algorithms, and SPIDER has been shown to achieve a near-optimal first-order oracle complexity in smooth nonconvex optimization. However, SPIDER uses an accuracy-dependent stepsize that slows down the convergence in practice, and cannot handle objective functions that involve nonsmooth regularizers. In this paper, we propose SpiderBoost as an improved scheme, which allows to use a much larger constant-level stepsize while maintaining the same near-optimal oracle complexity, and can be extended with proximal mapping to handle composite optimization (which is nonsmooth and nonconvex) with provable convergence guarantee. In particular, we show that proximal SpiderBoost achieves an oracle complexity of $\mathcal{O}(\min\{n^{1/2}ε^{-2},ε^{-3}\})$ in composite nonconvex optimization, improving the state-of-the-art result by a factor of $\mathcal{O}(\min\{n^{1/6},ε^{-1/3}\})$. We further develop a novel momentum scheme to accelerate SpiderBoost for composite optimization, which achieves the near-optimal oracle complexity in theory and substantial improvement in experiments.

preprint2020arXiv

Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning

As a popular meta-learning approach, the model-agnostic meta-learning (MAML) algorithm has been widely used due to its simplicity and effectiveness. However, the convergence of the general multi-step MAML still remains unexplored. In this paper, we develop a new theoretical framework to provide such convergence guarantee for two types of objective functions that are of interest in practice: (a) resampling case (e.g., reinforcement learning), where loss functions take the form in expectation and new data are sampled as the algorithm runs; and (b) finite-sum case (e.g., supervised learning), where loss functions take the finite-sum form with given samples. For both cases, we characterize the convergence rate and the computational complexity to attain an $ε$-accurate solution for multi-step MAML in the general nonconvex setting. In particular, our results suggest that an inner-stage stepsize needs to be chosen inversely proportional to the number $N$ of inner-stage steps in order for $N$-step MAML to have guaranteed convergence. From the technical perspective, we develop novel techniques to deal with the nested structure of the meta gradient for multi-step MAML, which can be of independent interest.

preprint2020arXiv

When Will Generative Adversarial Imitation Learning Algorithms Attain Global Convergence

Generative adversarial imitation learning (GAIL) is a popular inverse reinforcement learning approach for jointly optimizing policy and reward from expert trajectories. A primary question about GAIL is whether applying a certain policy gradient algorithm to GAIL attains a global minimizer (i.e., yields the expert policy), for which existing understanding is very limited. Such global convergence has been shown only for the linear (or linear-type) MDP and linear (or linearizable) reward. In this paper, we study GAIL under general MDP and for nonlinear reward function classes (as long as the objective function is strongly concave with respect to the reward parameter). We characterize the global convergence with a sublinear rate for a broad range of commonly used policy gradient algorithms, all of which are implemented in an alternating manner with stochastic gradient ascent for reward update, including projected policy gradient (PPG)-GAIL, Frank-Wolfe policy gradient (FWPG)-GAIL, trust region policy optimization (TRPO)-GAIL and natural policy gradient (NPG)-GAIL. This is the first systematic theoretical study of GAIL for global convergence.

preprint2016arXiv

A Kernel-Based Nonparametric Test for Anomaly Detection over Line Networks

The nonparametric problem of detecting existence of an anomalous interval over a one dimensional line network is studied. Nodes corresponding to an anomalous interval (if exists) receive samples generated by a distribution q, which is different from the distribution p that generates samples for other nodes. If anomalous interval does not exist, then all nodes receive samples generated by p. It is assumed that the distributions p and q are arbitrary, and are unknown. In order to detect whether an anomalous interval exists, a test is built based on mean embeddings of distributions into a reproducing kernel Hilbert space (RKHS) and the metric of maximummean discrepancy (MMD). It is shown that as the network size n goes to infinity, if the minimum length of candidate anomalous intervals is larger than a threshold which has the order O(log n), the proposed test is asymptotically successful, i.e., the probability of detection error approaches zero asymptotically. An efficient algorithm to perform the test with substantial computational complexity reduction is proposed, and is shown to be asymptotically successful if the condition on the minimum length of candidate anomalous interval is satisfied. Numerical results are provided, which are consistent with the theoretical results.

preprint2016arXiv

Nonparametric Detection of Anomalous Data Streams

A nonparametric anomalous hypothesis testing problem is investigated, in which there are totally n sequences with s anomalous sequences to be detected. Each typical sequence contains m independent and identically distributed (i.i.d.) samples drawn from a distribution p, whereas each anomalous sequence contains m i.i.d. samples drawn from a distribution q that is distinct from p. The distributions p and q are assumed to be unknown in advance. Distribution-free tests are constructed using maximum mean discrepancy as the metric, which is based on mean embeddings of distributions into a reproducing kernel Hilbert space. The probability of error is bounded as a function of the sample size m, the number s of anomalous sequences and the number n of sequences. It is then shown that with s known, the constructed test is exponentially consistent if m is greater than a constant factor of log n, for any p and q, whereas with s unknown, m should has an order strictly greater than log n. Furthermore, it is shown that no test can be consistent for arbitrary p and q if m is less than a constant factor of log n, thus the order-level optimality of the proposed test is established. Numerical results are provided to demonstrate that our tests outperform (or perform as well as) the tests based on other competitive approaches under various cases.

preprint2016arXiv

Reshaped Wirtinger Flow and Incremental Algorithm for Solving Quadratic System of Equations

We study the phase retrieval problem, which solves quadratic system of equations, i.e., recovers a vector $\boldsymbol{x}\in \mathbb{R}^n$ from its magnitude measurements $y_i=|\langle \boldsymbol{a}_i, \boldsymbol{x}\rangle|, i=1,..., m$. We develop a gradient-like algorithm (referred to as RWF representing reshaped Wirtinger flow) by minimizing a nonconvex nonsmooth loss function. In comparison with existing nonconvex Wirtinger flow (WF) algorithm \cite{candes2015phase}, although the loss function becomes nonsmooth, it involves only the second power of variable and hence reduces the complexity. We show that for random Gaussian measurements, RWF enjoys geometric convergence to a global optimal point as long as the number $m$ of measurements is on the order of $n$, the dimension of the unknown $\boldsymbol{x}$. This improves the sample complexity of WF, and achieves the same sample complexity as truncated Wirtinger flow (TWF) \cite{chen2015solving}, but without truncation in gradient loop. Furthermore, RWF costs less computationally than WF, and runs faster numerically than both WF and TWF. We further develop the incremental (stochastic) reshaped Wirtinger flow (IRWF) and show that IRWF converges linearly to the true signal. We further establish performance guarantee of an existing Kaczmarz method for the phase retrieval problem based on its connection to IRWF. We also empirically demonstrate that IRWF outperforms existing ITWF algorithm (stochastic version of TWF) as well as other batch algorithms.

preprint2015arXiv

Universal Outlying sequence detection For Continuous Observations

The following detection problem is studied, in which there are $M$ sequences of samples out of which one outlier sequence needs to be detected. Each typical sequence contains $n$ independent and identically distributed (i.i.d.) continuous observations from a known distribution $π$, and the outlier sequence contains $n$ i.i.d. observations from an outlier distribution $μ$, which is distinct from $π$, but otherwise unknown. A universal test based on KL divergence is built to approximate the maximum likelihood test, with known $π$ and unknown $μ$. A data-dependent partitions based KL divergence estimator is employed. Such a KL divergence estimator is further shown to converge to its true value exponentially fast when the density ratio satisfies $0<K_1\leq \frac{dμ}{dπ}\leq K_2$, where $K_1$ and $K_2$ are positive constants, and this further implies that the test is exponentially consistent. The performance of the test is compared with that of a recently introduced test for this problem based on the machine learning approach of maximum mean discrepancy (MMD). We identify regimes in which the KL divergence based test is better than the MMD based test.

preprint2014arXiv

An Information Theoretic Approach to Secret Sharing

A novel information theoretic approach is proposed to solve the secret sharing problem, in which a dealer distributes one or multiple secrets among a set of participants that for each secret only qualified sets of users can recover it by pooling their shares together while non-qualified sets of users obtain no information about the secret even if they pool their shares together. While existing secret sharing systems (implicitly) assume that communications between the dealer and participants are noiseless, this paper takes a more practical assumption that the dealer delivers shares to the participants via a noisy broadcast channel. An information theoretic approach is proposed, which exploits the channel as additional resources to achieve secret sharing requirements. In this way, secret sharing problems can be reformulated as equivalent secure communication problems via wiretap channels, and can be solved by employing powerful information theoretic security techniques. This approach is first developed for the classic secret sharing problem, in which only one secret is to be shared. This classic problem is shown to be equivalent to a communication problem over a compound wiretap channel. The lower and upper bounds on the secrecy capacity of the compound channel provide the corresponding bounds on the secret sharing rate. The power of the approach is further demonstrated by a more general layered multi-secret sharing problem, which is shown to be equivalent to the degraded broadcast multiple-input multiple-output (MIMO) channel with layered decoding and secrecy constraints. The secrecy capacity region for the degraded MIMO broadcast channel is characterized, which provides the secret sharing capacity region. Furthermore, these secure encoding schemes that achieve the secrecy capacity region provide an information theoretic scheme for sharing the secrets.

preprint2014arXiv

Parallel Gaussian Networks with a Common State-Cognitive Helper

A class of state-dependent parallel networks with a common state-cognitive helper, in which $K$ transmitters wish to send $K$ messages to their corresponding receivers over $K$ state-corrupted parallel channels, and a helper who knows the state information noncausally wishes to assist these receivers to cancel state interference. Furthermore, the helper also has its own message to be sent simultaneously to its corresponding receiver. Since the state information is known only to the helper, but not to the corresponding transmitters $1,\dots,K$, transmitter-side state cognition and receiver-side state interference are mismatched. Our focus is on the high state power regime, i.e., the state power goes to infinity. Three (sub)models are studied. Model I serves as a basic model, which consists of only one transmitter-receiver (with state corruption) pair in addition to a helper that assists the receiver to cancel state in addition to transmitting its own message. Model II consists of two transmitter-receiver pairs in addition to a helper, and only one receiver is interfered by a state sequence. Model III generalizes model I include multiple transmitter-receiver pairs with each receiver corrupted by independent state. For all models, inner and outer bounds on the capacity region are derived, and comparison of the two bounds leads to characterization of either full or partial boundary of the capacity region under various channel parameters.

preprint2014arXiv

The Capacity Region of the Source-Type Model for Secret Key and Private Key Generation

The problem of simultaneously generating a secret key (SK) and private key (PK) pair among three terminals via public discussion is investigated, in which each terminal observes a component of correlated sources. All three terminals are required to generate a common secret key concealed from an eavesdropper that has access to public discussion, while two designated terminals are required to generate an extra private key concealed from both the eavesdropper and the remaining terminal. An outer bound on the SK-PK capacity region was established in [1], and was shown to be achievable for one case. In this paper, achievable schemes are designed to achieve the outer bound for the remaining two cases, and hence the SK-PK capacity region is established in general. The main technique lies in the novel design of a random binning-joint decoding scheme that achieves the existing outer bound.

preprint2013arXiv

Sharp Threshold for Multivariate Multi-Response Linear Regression via Block Regularized Lasso

In this paper, we investigate a multivariate multi-response (MVMR) linear regression problem, which contains multiple linear regression models with differently distributed design matrices, and different regression and output vectors. The goal is to recover the support union of all regression vectors using $l_1/l_2$-regularized Lasso. We characterize sufficient and necessary conditions on sample complexity \emph{as a sharp threshold} to guarantee successful recovery of the support union. Namely, if the sample size is above the threshold, then $l_1/l_2$-regularized Lasso correctly recovers the support union; and if the sample size is below the threshold, $l_1/l_2$-regularized Lasso fails to recover the support union. In particular, the threshold precisely captures the impact of the sparsity of regression vectors and the statistical properties of the design matrices on sample complexity. Therefore, the threshold function also captures the advantages of joint support union recovery using multi-task Lasso over individual support recovery using single-task Lasso.

preprint2009arXiv

Multiple-Input Multiple-Output Gaussian Broadcast Channels with Common and Confidential Messages

This paper considers the problem of the multiple-input multiple-output (MIMO) Gaussian broadcast channel with two receivers (receivers 1 and 2) and two messages: a common message intended for both receivers and a confidential message intended only for receiver 1 but needing to be kept asymptotically perfectly secure from receiver 2. A matrix characterization of the secrecy capacity region is established via a channel enhancement argument. The enhanced channel is constructed by first splitting receiver 1 into two virtual receivers and then enhancing only the virtual receiver that decodes the confidential message. The secrecy capacity region of the enhanced channel is characterized using an extremal entropy inequality previously established for characterizing the capacity region of a degraded compound MIMO Gaussian broadcast channel.

preprint2009arXiv

On the Compound MIMO Broadcast Channels with Confidential Messages

We study the compound multi-input multi-output (MIMO) broadcast channel with confidential messages (BCC), where one transmitter sends a common message to two receivers and two confidential messages respectively to each receiver. The channel state may take one of a finite set of states, and the transmitter knows the state set but does not know the realization of the state. We study achievable rates with perfect secrecy in the high SNR regime by characterizing an achievable secrecy degree of freedom (s.d.o.f.) region for two models, the Gaussian MIMO-BCC and the ergodic fading multi-input single-output (MISO)-BCC without a common message. We show that by exploiting an additional temporal dimension due to state variation in the ergodic fading model, the achievable s.d.o.f. region can be significantly improved compared to the Gaussian model with a constant state, although at the price of a larger delay.

preprint2007arXiv

Opportunistic Communications in an Orthogonal Multiaccess Relay Channel

The problem of resource allocation is studied for a two-user fading orthogonal multiaccess relay channel (MARC) where both users (sources) communicate with a destination in the presence of a relay. A half-duplex relay is considered that transmits on a channel orthogonal to that used by the sources. The instantaneous fading state between every transmit-receive pair in this network is assumed to be known at both the transmitter and receiver. Under an average power constraint at each source and the relay, the sum-rate for the achievable strategy of decode-and-forward (DF) is maximized over all power allocations (policies) at the sources and relay. It is shown that the sum-rate maximizing policy exploits the multiuser fading diversity to reveal the optimality of opportunistic channel use by each user. A geometric interpretation of the optimal power policy is also presented.

preprint2007arXiv

Secrecy Capacity Region of Fading Broadcast Channels

The fading broadcast channel with confidential messages (BCC) is investigated, where a source node has common information for two receivers (receivers 1 and 2), and has confidential information intended only for receiver 1. The confidential information needs to be kept as secret as possible from receiver 2. The channel state information (CSI) is assumed to be known at both the transmitter and the receivers. The secrecy capacity region is first established for the parallel Gaussian BCC, and the optimal source power allocations that achieve the boundary of the secrecy capacity region are derived. In particular, the secrecy capacity region is established for the Gaussian case of the Csiszar-Korner BCC model. The secrecy capacity results are then applied to give the ergodic secrecy capacity region for the fading BCC.

preprint2007arXiv

Secure Nested Codes for Type II Wiretap Channels

This paper considers the problem of secure coding design for a type II wiretap channel, where the main channel is noiseless and the eavesdropper channel is a general binary-input symmetric-output memoryless channel. The proposed secure error-correcting code has a nested code structure. Two secure nested coding schemes are studied for a type II Gaussian wiretap channel. The nesting is based on cosets of a good code sequence for the first scheme and on cosets of the dual of a good code sequence for the second scheme. In each case, the corresponding achievable rate-equivocation pair is derived based on the threshold behavior of good code sequences. The two secure coding schemes together establish an achievable rate-equivocation region, which almost covers the secrecy capacity-equivocation region in this case study. The proposed secure coding scheme is extended to a type II binary symmetric wiretap channel. A new achievable perfect secrecy rate, which improves upon the previously reported result by Thangaraj et al., is derived for this channel.

Yingbin Liang

What is connected

Connect this record

See the researcher in context

Building this map preview

36 published item(s)

Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs

A Primal-Dual Approach to Bilevel Optimization with Multiple Inner Minima

Data Sampling Affects the Complexity of Online SGD over Dependent Data

Lower Bounds and Accelerated Algorithms for Bilevel Optimization

Model-Based Offline Meta-Reinforcement Learning with Regularization

On the Convergence Theory for Hessian-Free Bilevel Algorithms

PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method

Provable Generalization of Overparameterized Meta-learning Trained with SGD

Will Bilevel Optimizers Benefit from Loops

Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms

Proximal Gradient Descent-Ascent: Variable Convergence under KŁ Geometry

Analysis of Q-learning with Adaptation and Momentum Restart for Gradient Descent

Guaranteed Recovery of One-Hidden-Layer Neural Networks via Cross Entropy

History-Gradient Aided Batch Size Adaptation for Variance Reduced Algorithms

Momentum Q-learning with Finite-Sample Convergence Guarantee

Non-asymptotic Convergence Analysis of Two Time-scale (Natural) Actor-Critic Algorithms

Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling

Proximal Gradient Algorithm with Momentum and Flexible Parameter Restart for Nonconvex Optimization

Reanalysis of Variance Reduced Temporal Difference Learning

Robust Stochastic Bandit Algorithms under Probabilistic Unbounded Adversarial Attack

SpiderBoost and Momentum: Faster Stochastic Variance Reduction Algorithms

Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning

When Will Generative Adversarial Imitation Learning Algorithms Attain Global Convergence

A Kernel-Based Nonparametric Test for Anomaly Detection over Line Networks

Nonparametric Detection of Anomalous Data Streams

Reshaped Wirtinger Flow and Incremental Algorithm for Solving Quadratic System of Equations

Universal Outlying sequence detection For Continuous Observations

An Information Theoretic Approach to Secret Sharing

Parallel Gaussian Networks with a Common State-Cognitive Helper

The Capacity Region of the Source-Type Model for Secret Key and Private Key Generation

Sharp Threshold for Multivariate Multi-Response Linear Regression via Block Regularized Lasso

Multiple-Input Multiple-Output Gaussian Broadcast Channels with Common and Confidential Messages

On the Compound MIMO Broadcast Channels with Confidential Messages

Opportunistic Communications in an Orthogonal Multiaccess Relay Channel

Secrecy Capacity Region of Fading Broadcast Channels

Secure Nested Codes for Type II Wiretap Channels