Researcher profile

Csaba Szepesvari

Csaba Szepesvari contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
16works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

16 published item(s)

preprint2023arXiv

The Role of Baselines in Policy Gradient Optimization

We study the effect of baselines in on-policy stochastic policy gradient optimization, and close the gap between the theory and practice of policy optimization methods. Our first contribution is to show that the \emph{state value} baseline allows on-policy stochastic \emph{natural} policy gradient (NPG) to converge to a globally optimal policy at an $O(1/t)$ rate, which was not previously known. The analysis relies on two novel findings: the expected progress of the NPG update satisfies a stochastic version of the non-uniform Łojasiewicz (NŁ) inequality, and with probability 1 the state value baseline prevents the optimal action's probability from vanishing, thus ensuring sufficient exploration. Importantly, these results provide a new understanding of the role of baselines in stochastic policy gradient: by showing that the variance of natural policy gradient estimates remains unbounded with or without a baseline, we find that variance reduction \emph{cannot} explain their utility in this setting. Instead, the analysis reveals that the primary effect of the value baseline is to \textbf{reduce the aggressiveness of the updates} rather than their variance. That is, we demonstrate that a finite variance is \emph{not necessary} for almost sure convergence of stochastic NPG, while controlling update aggressiveness is both necessary and sufficient. Additional experimental results verify these theoretical findings.

preprint2022arXiv

Leveraging Non-uniformity in First-order Non-convex Optimization

Classical global convergence results for first-order methods rely on uniform smoothness and the Łojasiewicz inequality. Motivated by properties of objective functions that arise in machine learning, we propose a non-uniform refinement of these notions, leading to \emph{Non-uniform Smoothness} (NS) and \emph{Non-uniform Łojasiewicz inequality} (NŁ). The new definitions inspire new geometry-aware first-order methods that are able to converge to global optimality faster than the classical $Ω(1/t^2)$ lower bounds. To illustrate the power of these geometry-aware methods and their corresponding non-uniform analysis, we consider two important problems in machine learning: policy gradient optimization in reinforcement learning (PG), and generalized linear model training in supervised learning (GLM). For PG, we find that normalizing the gradient ascent method can accelerate convergence to $O(e^{-t})$ while incurring less overhead than existing algorithms. For GLM, we show that geometry-aware normalized gradient descent can also achieve a linear convergence rate, which significantly improves the best known results. We additionally show that the proposed geometry-aware descent methods escape landscape plateaus faster than standard gradient descent. Experimental results are used to illustrate and complement the theoretical findings.

preprint2022arXiv

On the Global Convergence Rates of Softmax Policy Gradient Methods

We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a Łojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate $O(e^{-c \cdot t})$ toward softmax optimal policy $(c > 0)$. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new $Ω(1/t)$ lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform Łojasiewicz degree. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies.

preprint2022arXiv

Towards Painless Policy Optimization for Constrained MDPs

We study policy optimization in an infinite horizon, $γ$-discounted constrained Markov decision process (CMDP). Our objective is to return a policy that achieves large expected reward with a small constraint violation. We consider the online setting with linear function approximation and assume global access to the corresponding features. We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms in terms of their primal and dual regret on online linear optimization problems. We instantiate this framework to use coin-betting algorithms and propose the Coin Betting Politex (CBP) algorithm. Assuming that the action-value functions are $\varepsilon_b$-close to the span of the $d$-dimensional state-action features and no sampling errors, we prove that $T$ iterations of CBP result in an $O\left(\frac{1}{(1 - γ)^3 \sqrt{T}} + \frac{\varepsilon_b\sqrt{d}}{(1 - γ)^2} \right)$ reward sub-optimality and an $O\left(\frac{1}{(1 - γ)^2 \sqrt{T}} + \frac{\varepsilon_b \sqrt{d}}{1 - γ} \right)$ constraint violation. Importantly, unlike gradient descent-ascent and other recent methods, CBP does not require extensive hyperparameter tuning. Via experiments on synthetic and Cartpole environments, we demonstrate the effectiveness and robustness of CBP.

preprint2021arXiv

Adaptive Approximate Policy Iteration

Model-free reinforcement learning algorithms combined with value function approximation have recently achieved impressive performance in a variety of application domains. However, the theoretical understanding of such algorithms is limited, and existing results are largely focused on episodic or discounted Markov decision processes (MDPs). In this work, we present adaptive approximate policy iteration (AAPI), a learning scheme which enjoys a $\tilde{O}(T^{2/3})$ regret bound for undiscounted, continuing learning in uniformly ergodic MDPs. This is an improvement over the best existing bound of $\tilde{O}(T^{3/4})$ for the average-reward case with function approximation. Our algorithm and analysis rely on online learning techniques, where value functions are treated as losses. The main technical novelty is the use of a data-dependent adaptive learning rate coupled with a so-called optimistic prediction of upcoming losses. In addition to theoretical guarantees, we demonstrate the advantages of our approach empirically on several environments.

preprint2021arXiv

Improved Regret Bound and Experience Replay in Regularized Policy Iteration

In this work, we study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the Politex algorithm (a version of regularized policy iteration) can be sharpened from $O(T^{3/4})$ to $O(\sqrt{T})$ under nearly identical assumptions, and instantiate the bound with linear function approximation. Our result provides the first high-probability $O(\sqrt{T})$ regret bound for a computationally efficient algorithm in this setting. The exact implementation of Politex with neural network function approximation is inefficient in terms of memory and computation. Since our analysis suggests that we need to approximate the average of the action-value functions of past policies well, we propose a simple efficient implementation where we train a single Q-function on a replay buffer with past data. We show that this often leads to superior performance over other implementation choices, especially in terms of wall-clock time. Our work also provides a novel theoretical justification for using experience replay within policy iteration algorithms.

preprint2021arXiv

Meta-Learning Bandit Policies by Gradient Ascent

Most bandit policies are designed to either minimize regret in any problem instance, making very few assumptions about the underlying environment, or in a Bayesian sense, assuming a prior distribution over environment parameters. The former are often too conservative in practical settings, while the latter require assumptions that are hard to verify in practice. We study bandit problems that fall between these two extremes, where the learning agent has access to sampled bandit instances from an unknown prior distribution $\mathcal{P}$ and aims to achieve high reward on average over the bandit instances drawn from $\mathcal{P}$. This setting is of a particular importance because it lays foundations for meta-learning of bandit policies and reflects more realistic assumptions in many practical domains. We propose the use of parameterized bandit policies that are differentiable and can be optimized using policy gradients. This provides a broadly applicable framework that is easy to implement. We derive reward gradients that reflect the structure of bandit problems and policies, for both non-contextual and contextual settings, and propose a number of interesting policies that are both differentiable and have low regret. Our algorithmic and theoretical contributions are supported by extensive experiments that show the importance of baseline subtraction, learned biases, and the practicality of our approach on a range problems.

preprint2021arXiv

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named $\text{UCRL-VTR}^{+}$ for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that $\text{UCRL-VTR}^{+}$ attains an $\tilde O(dH\sqrt{T})$ regret where $d$ is the dimension of feature mapping, $H$ is the length of the episode and $T$ is the number of interactions with the MDP. We also prove a matching lower bound $Ω(dH\sqrt{T})$ for this setting, which shows that $\text{UCRL-VTR}^{+}$ is minimax optimal up to logarithmic factors. In addition, we propose the $\text{UCLK}^{+}$ algorithm for the same family of MDPs under discounting and show that it attains an $\tilde O(d\sqrt{T}/(1-γ)^{1.5})$ regret, where $γ\in [0,1)$ is the discount factor. Our upper bound matches the lower bound $Ω(d\sqrt{T}/(1-γ)^{1.5})$ proved by Zhou et al. (2020) up to logarithmic factors, suggesting that $\text{UCLK}^{+}$ is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.

preprint2020arXiv

Adaptive Exploration in Linear Contextual Bandit

Contextual bandits serve as a fundamental model for many sequential decision making tasks. The most popular theoretically justified approaches are based on the optimism principle. While these algorithms can be practical, they are known to be suboptimal asymptotically. On the other hand, existing asymptotically optimal algorithms for this problem do not exploit the linear structure in an optimal way and suffer from lower-order terms that dominate the regret in all practically interesting regimes. We start to bridge the gap by designing an algorithm that is asymptotically optimal and has good finite-time empirical performance. At the same time, we make connections to the recent literature on when exploration-free methods are effective. Indeed, if the distribution of contexts is well behaved, then our algorithm acts mostly greedily and enjoys sub-logarithmic regret. Furthermore, our approach is adaptive in the sense that it automatically detects the nice case. Numerical results demonstrate significant regret reductions by our method relative to several baselines.

preprint2020arXiv

Behaviour Suite for Reinforcement Learning

This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. First, to collect clear, informative and scalable problems that capture key issues in the design of general and efficient learning algorithms. Second, to study agent behaviour through their performance on these shared benchmarks. To complement this effort, we open source github.com/deepmind/bsuite, which automates evaluation and analysis of any agent on bsuite. This library facilitates reproducible and accessible research on the core issues in RL, and ultimately the design of superior learning algorithms. Our code is Python, and easy to use within existing projects. We include examples with OpenAI Baselines, Dopamine as well as new reference implementations. Going forward, we hope to incorporate more excellent experiments from the research community, and commit to a periodic review of bsuite from a committee of prominent researchers.

preprint2020arXiv

Differentiable Bandit Exploration

Exploration policies in Bayesian bandits maximize the average reward over problem instances drawn from some distribution $\mathcal{P}$. In this work, we learn such policies for an unknown distribution $\mathcal{P}$ using samples from $\mathcal{P}$. Our approach is a form of meta-learning and exploits properties of $\mathcal{P}$ without making strong assumptions about its form. To do this, we parameterize our policies in a differentiable way and optimize them by policy gradients, an approach that is general and easy to implement. We derive effective gradient estimators and introduce novel variance reduction techniques. We also analyze and experiment with various bandit policy classes, including neural networks and a novel softmax policy. The latter has regret guarantees and is a natural starting point for our optimization. Our experiments show the versatility of our approach. We also observe that neural network policies can learn implicit biases expressed only through the sampled instances.

preprint2020arXiv

Empirical Bayes Regret Minimization

Most bandit algorithm designs are purely theoretical. Therefore, they have strong regret guarantees, but also are often too conservative in practice. In this work, we pioneer the idea of algorithm design by minimizing the empirical Bayes regret, the average regret over problem instances sampled from a known distribution. We focus on a tractable instance of this problem, the confidence interval and posterior width tuning, and propose an efficient algorithm for solving it. The tuning algorithm is analyzed and evaluated in multi-armed, linear, and generalized linear bandits. We report several-fold reductions in Bayes regret for state-of-the-art bandit algorithms, simply by optimizing over a small sample from a distribution.

preprint2020arXiv

Gradient Descent for Sparse Rank-One Matrix Completion for Crowd-Sourced Aggregation of Sparsely Interacting Workers

We consider worker skill estimation for the single-coin Dawid-Skene crowdsourcing model. In practice, skill-estimation is challenging because worker assignments are sparse and irregular due to the arbitrary and uncontrolled availability of workers. We formulate skill estimation as a rank-one correlation-matrix completion problem, where the observed components correspond to observed label correlations between workers. We show that the correlation matrix can be successfully recovered and skills are identifiable if and only if the sampling matrix (observed components) does not have a bipartite connected component. We then propose a projected gradient descent scheme and show that skill estimates converge to the desired global optima for such sampling matrices. Our proof is original and the results are surprising in light of the fact that even the weighted rank-one matrix factorization problem is NP-hard in general. Next, we derive sample complexity bounds in terms of spectral properties of the signless Laplacian of the sampling matrix. Our proposed scheme achieves state-of-art performance on a number of real-world datasets.

preprint2020arXiv

Learning with Good Feature Representations in Bandits and in RL with a Generative Model

The construction by Du et al. (2019) implies that even if a learner is given linear features in $\mathbb R^d$ that approximate the rewards in a bandit with a uniform error of $ε$, then searching for an action that is optimal up to $O(ε)$ requires examining essentially all actions. We use the Kiefer-Wolfowitz theorem to prove a positive result that by checking only a few actions, a learner can always find an action that is suboptimal with an error of at most $O(ε\sqrt{d})$. Thus, features are useful when the approximation error is small relative to the dimensionality of the features. The idea is applied to stochastic bandits and reinforcement learning with a generative model where the learner has access to $d$-dimensional linear features that approximate the action-value functions for all policies to an accuracy of $ε$. For linear bandits, we prove a bound on the regret of order $\sqrt{dn \log(k)} + εn \sqrt{d} \log(n)$ with $k$ the number of actions and $n$ the horizon. For RL we show that approximate policy iteration can learn a policy that is optimal up to an additive error of order $ε\sqrt{d}/(1 - γ)^2$ and using $d/(ε^2(1 - γ)^4)$ samples from a generative model. These bounds are independent of the finer details of the features. We also investigate how the structure of the feature set impacts the tradeoff between sample complexity and estimation error.

preprint2020arXiv

Model-Based Reinforcement Learning with Value-Targeted Regression

This paper studies model-based reinforcement learning (RL) for regret minimization. We focus on finite-horizon episodic RL where the transition model $P$ belongs to a known family of models $\mathcal{P}$, a special case of which is when models in $\mathcal{P}$ take the form of linear mixtures: $P_θ = \sum_{i=1}^{d} θ_{i}P_{i}$. We propose a model based RL algorithm that is based on optimism principle: In each episode, the set of models that are `consistent' with the data collected is constructed. The criterion of consistency is based on the total squared error of that the model incurs on the task of predicting \emph{values} as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret, which, in the special case of linear mixtures, the regret bound takes the form $\tilde{\mathcal{O}}(d\sqrt{H^{3}T})$, where $H$, $T$ and $d$ are the horizon, total number of steps and dimension of $θ$, respectively. In particular, this regret bound is independent of the total number of states or actions, and is close to a lower bound $Ω(\sqrt{HdT})$. For a general model family $\mathcal{P}$, the regret bound is derived using the notion of the so-called Eluder dimension proposed by Russo & Van Roy (2014).

preprint2020arXiv

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. We prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex. We also establish its rate of convergence of the order $O(1/t)$ by exploiting the hidden convexity of the problem, and proves that it converges exponentially when the problem admits hidden strong convexity. Our analysis applies to the standard RL problem with cumulative rewards as a special case, in which case our result improves the available convergence rate.