Source author record

Tengyang Xie

Tengyang Xie appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence

Catalog footprint

What is connected

5works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Adversarially Trained Actor Critic for Offline Reinforcement Learning

We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a two-player Stackelberg game: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks.

preprint2022arXiv

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy" $μ$ close to the optimal policy $π_\star$ in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and horizon length $H$. We first design a sharp offline reduction algorithm -- which simply executes $μ$ and runs offline policy optimization on the collected dataset -- that finds an $\varepsilon$ near-optimal policy within $\widetilde{O}(H^3SC^\star/\varepsilon^2)$ episodes, where $C^\star$ is the single-policy concentrability coefficient between $μ$ and $π_\star$. This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an $Ω(H^3S\min\{C^\star, A\}/\varepsilon^2)$ sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the environment. This implies that -- perhaps surprisingly -- the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use $μ$. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where $μ$ only satisfies concentrability partially up to a certain time step.

preprint2020arXiv

Provably Efficient Q-Learning with Low Switching Cost

We take initial steps in studying PAC-MDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in real-world applications (such as medical domains), and we propose to quantify adaptivity using the notion of local switching cost. Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is $O(H^3SA\log K)$, and we provide a lower bound of $Ω(HSA)$ on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting, which yields nontrivial results that improve upon prior work in certain aspects.

preprint2020arXiv

Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison

We prove performance guarantees of two algorithms for approximating $Q^\star$ in batch reinforcement learning. Compared to classical iterative methods such as Fitted Q-Iteration---whose performance loss incurs quadratic dependence on horizon---these methods estimate (some forms of) the Bellman error and enjoy linear-in-horizon error propagation, a property established for the first time for algorithms that rely solely on batch data and output stationary policies. One of the algorithms uses a novel and explicit importance-weighting correction to overcome the infamous "double sampling" difficulty in Bellman error estimation, and does not use any squared losses. Our analyses reveal its distinct characteristics and potential advantages compared to classical algorithms.

preprint2020arXiv

Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) -- the problem of evaluating a new policy using the historical data obtained by different behavior policies -- under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon $H$. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of $$ \frac{1}{n} \sum\nolimits_{t=1}^H\mathbb{E}_μ\left[\frac{d_t^π(s_t)^2}{d_t^μ(s_t)^2} \mathrm{Var}_μ\left[\frac{π_t(a_t|s_t)}{μ_t(a_t|s_t)}\big( V_{t+1}^π(s_{t+1}) + r_t\big) \middle| s_t\right]\right] + \tilde{O}(n^{-1.5}) $$ where $μ$ and $π$ are the logging and target policies, $d_t^μ(s_t)$ and $d_t^π(s_t)$ are the marginal distribution of the state at $t$th step, $H$ is the horizon, $n$ is the sample size and $V_{t+1}^π$ is the value function of the MDP under $π$. The result matches the Cramer-Rao lower bound in \citet{jiang2016doubly} up to a multiplicative factor of $H$. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on $H$. Besides theory, we show empirical superiority of our method in time-varying, partially observable, and long-horizon RL environments.

Tengyang Xie

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

Adversarially Trained Actor Critic for Offline Reinforcement Learning

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Provably Efficient Q-Learning with Low Switching Cost

Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison

Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling