Source author record

Sean P. Meyn

Sean P. Meyn appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.OC math.PR Systems and Control Machine Learning eess.SY math.NA Information Theory math.IT

Catalog footprint

What is connected

16works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Controlled Interacting Particle Algorithms for Simulation-based Reinforcement Learning

This paper is concerned with optimal control problems for control systems in continuous time, and interacting particle system methods designed to construct approximate control solutions. Particular attention is given to the linear quadratic (LQ) control problem. There is a growing interest in re-visiting this classical problem, in part due to the successes of reinforcement learning (RL). The main question of this body of research (and also of our paper) is to approximate the optimal control law {\em without} explicitly solving the Riccati equation. A novel simulation-based algorithm, namely a dual ensemble Kalman filter (EnKF), is introduced. The algorithm is used to obtain formulae for optimal control, expressed entirely in terms of the EnKF particles. An extension to the nonlinear case is also presented. The theoretical results and algorithms are illustrated with numerical experiments.

preprint2020arXiv

Convex Q-Learning, Part 1: Deterministic Optimal Control

It is well known that the extension of Watkins' algorithm to general function approximation settings is challenging: does the projected Bellman equation have a solution? If so, is the solution useful in the sense of generating a good policy? And, if the preceding questions are answered in the affirmative, is the algorithm consistent? These questions are unanswered even in the special case of Q-function approximations that are linear in the parameter. The challenge seems paradoxical, given the long history of convex analytic approaches to dynamic programming. The paper begins with a brief survey of linear programming approaches to optimal control, leading to a particular over parameterization that lends itself to applications in reinforcement learning. The main conclusions are summarized as follows: (i) The new class of convex Q-learning algorithms is introduced based on the convex relaxation of the Bellman equation. Convergence is established under general conditions, including a linear function approximation for the Q-function. (ii) A batch implementation appears similar to the famed DQN algorithm (one engine behind AlphaZero). It is shown that in fact the algorithms are very different: while convex Q-learning solves a convex program that approximates the Bellman equation, theory for DQN is no stronger than for Watkins' algorithm with function approximation: (a) it is shown that both seek solutions to the same fixed point equation, and (b) the ODE approximations for the two algorithms coincide, and little is known about the stability of this ODE. These results are obtained for deterministic nonlinear systems with total cost criterion. Many extensions are proposed, including kernel implementation, and extension to MDP models.

preprint2020arXiv

Differential Temporal Difference Learning

Value functions derived from Markov decision processes arise as a central component of algorithms as well as performance metrics in many statistics and engineering applications of machine learning techniques. Computation of the solution to the associated Bellman equations is challenging in most practical cases of interest. A popular class of approximation techniques, known as Temporal Difference (TD) learning algorithms, are an important sub-class of general reinforcement learning methods. The algorithms introduced in this paper are intended to resolve two well-known difficulties of TD-learning approaches: Their slow convergence due to very high variance, and the fact that, for the problem of computing the relative value function, consistent algorithms exist only in special cases. First we show that the gradients of these value functions admit a representation that lends itself to algorithm design. Based on this result, a new class of differential TD-learning algorithms is introduced. For Markovian models on Euclidean space with smooth dynamics, the algorithms are shown to be consistent under general conditions. Numerical results show dramatic variance reduction when compared to standard methods.

preprint2020arXiv

Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning

Sample complexity bounds are a common performance metric in the Reinforcement Learning literature. In the discounted cost, infinite horizon setting, all of the known bounds have a factor that is a polynomial in $1/(1-γ)$, where $γ< 1$ is the discount factor. For a large discount factor, these bounds seem to imply that a very large number of samples is required to achieve an $\varepsilon$-optimal policy. The objective of the present work is to introduce a new class of algorithms that have sample complexity uniformly bounded for all $γ< 1$. One may argue that this is impossible, due to a recent min-max lower bound. The explanation is that this previous lower bound is for a specific problem, which we modify, without compromising the ultimate objective of obtaining an $\varepsilon$-optimal policy. Specifically, we show that the asymptotic covariance of the Q-learning algorithm with an optimized step-size sequence is a quadratic function of $1/(1-γ)$; an expected, and essentially known result. The new relative Q-learning algorithm proposed here is shown to have asymptotic covariance that is a quadratic in $1/(1- ρ^* γ)$, where $1 - ρ^* > 0$ is an upper bound on the spectral gap of an optimal transition matrix.

preprint2020arXiv

Variance Reduction in Simulation of Multiclass Processing Networks

We use simulation to estimate the steady-state performance of a stable multiclass queueing network. Standard estimators have been seen to perform poorly when the network is heavily loaded. We introduce two new simulation estimators. The first provides substantial variance reductions in moderately-loaded networks at very little additional computational cost. The second estimator provides substantial variance reductions in heavy traffic, again for a small additional computational cost. Both methods employ the variance reduction method of control variates, and differ in terms of how the control variates are constructed.

preprint2020arXiv

Zap Q-Learning With Nonlinear Function Approximation

Zap Q-learning is a recent class of reinforcement learning algorithms, motivated primarily as a means to accelerate convergence. Stability theory has been absent outside of two restrictive classes: the tabular setting, and optimal stopping. This paper introduces a new framework for analysis of a more general class of recursive algorithms known as stochastic approximation. Based on this general theory, it is shown that Zap Q-learning is consistent under a non-degeneracy assumption, even when the function approximation architecture is nonlinear. Zap Q-learning with neural network function approximation emerges as a special case, and is tested on examples from OpenAI Gym. Based on multiple experiments with a range of neural network sizes, it is found that the new algorithms converge quickly and are robust to choice of function approximation architecture.

preprint2016arXiv

Approximating a Diffusion by a Hidden Markov Model

For a wide class of continuous-time Markov processes, including all irreducible hypoelliptic diffusions evolving on an open, connected subset of $\RL^d$, the following are shown to be equivalent: (i) The process satisfies (a slightly weaker version of) the classical Donsker-Varadhan conditions; (ii) The transition semigroup of the process can be approximated by a finite-state hidden Markov model, in a strong sense in terms of an associated operator norm; (iii) The resolvent kernel of the process is `$v$-separable', that is, it can be approximated arbitrarily well in operator norm by finite-rank kernels. Under any (hence all) of the above conditions, the Markov process is shown to have a purely discrete spectrum on a naturally associated weighted $L_\infty$ space.

preprint2016arXiv

Error Estimates for the Kernel Gain Function Approximation in the Feedback Particle Filter

This paper is concerned with the analysis of the kernel-based algorithm for gain function approximation in the feedback particle filter. The exact gain function is the solution of a Poisson equation involving a probability-weighted Laplacian. The kernel-based method -- introduced in our prior work -- allows one to approximate this solution using {\em only} particles sampled from the probability distribution. This paper describes new representations and algorithms based on the kernel-based method. Theory surrounding the approximation is improved and a novel formula for the gain function approximation is derived. A procedure for carrying out error analysis of the approximation is introduced. Certain asymptotic estimates for bias and variance are derived for the general nonlinear non-Gaussian case. Comparison with the constant gain function approximation is provided. The results are illustrated with the aid of some numerical experiments.

preprint2016arXiv

Rationally inattentive control of Markov processes

The article poses a general model for optimal control subject to information constraints, motivated in part by recent work of Sims and others on information-constrained decision-making by economic agents. In the average-cost optimal control framework, the general model introduced in this paper reduces to a variant of the linear-programming representation of the average-cost optimal control problem, subject to an additional mutual information constraint on the randomized stationary policy. The resulting optimization problem is convex and admits a decomposition based on the Bellman error, which is the object of study in approximate dynamic programming. The theory is illustrated through the example of information-constrained linear-quadratic-Gaussian (LQG) control problem. Some results on the infinite-horizon discounted-cost criterion are also presented.

preprint2014arXiv

Poisson's equation in nonlinear filtering

The aim of this paper is to provide a variational interpretation of the nonlinear filter in continuous time. A time-stepping procedure is introduced, consisting of successive minimization problems in the space of probability densities. The weak form of the nonlinear filter is derived via analysis of the first-order optimality conditions for these problems. The derivation shows the nonlinear filter dynamics may be regarded as a gradient flow, or a steepest descent, for a certain energy functional with respect to the Kullback-Leibler divergence. The second part of the paper is concerned with derivation of the feedback particle filter algorithm, based again on the analysis of the first variation. The algorithm is shown to be exact. That is, the posterior distribution of the particle matches exactly the true posterior, provided the filter is initialized with the true prior.

preprint2013arXiv

Feedback Particle Filter

A new formulation of the particle filter for nonlinear filtering is presented, based on concepts from optimal control, and from the mean-field game theory. The optimal control is chosen so that the posterior distribution of a particle matches as closely as possible the posterior distribution of the true state given the observations. This is achieved by introducing a cost function, defined by the Kullback-Leibler (K-L) divergence between the actual posterior, and the posterior of any particle. The optimal control input is characterized by a certain Euler-Lagrange (E-L) equation, and is shown to admit an innovation error-based feedback structure. For diffusions with continuous observations, the value of the optimal control solution is ideal. The two posteriors match exactly, provided they are initialized with identical priors. The feedback particle filter is defined by a family of stochastic systems, each evolving under this optimal control law. A numerical algorithm is introduced and implemented in two general examples, and a neuroscience application involving coupled oscillators. Some preliminary numerical comparisons between the feed- back particle filter and the bootstrap particle filter are described.

preprint2013arXiv

Multivariable Feedback Particle Filter

In recent work it is shown that importance sampling can be avoided in the particle filter through an innovation structure inspired by traditional nonlinear filtering combined with Mean-Field Game formalisms. The resulting feedback particle filter (FPF) offers significant variance improvements; in particular, the algorithm can be applied to systems that are not stable. The filter comes with an up-front computational cost to obtain the filter gain. This paper describes new representations and algorithms to compute the gain in the general multivariable setting. The main contributions are, (i) Theory surrounding the FPF is improved: Consistency is established in the multivariate setting, as well as well-posedness of the associated PDE to obtain the filter gain. (ii) The gain can be expressed as the gradient of a function, which is precisely the solution to Poisson's equation for a related MCMC diffusion (the Smoluchowski equation). This provides a bridge to MCMC as well as to approximate optimal filtering approaches such as TD-learning, which can in turn be used to approximate the gain. (iii) Motivated by a weak formulation of Poisson's equation, a Galerkin finite-element algorithm is proposed for approximation of the gain. Its performance is illustrated in numerical experiments.

preprint2012arXiv

Random-Time, State-Dependent Stochastic Drift for Markov Chains and Application to Stochastic Stabilization Over Erasure Channels

It is known that state-dependent, multi-step Lyapunov bounds lead to greatly simplified verification theorems for stability for large classes of Markov chain models. This is one component of the "fluid model" approach to stability of stochastic networks. In this paper we extend the general theory to randomized multi-step Lyapunov theory to obtain criteria for stability and steady-state performance bounds, such as finite moments. These results are applied to a remote stabilization problem, in which a controller receives measurements from an erasure channel with limited capacity. Based on the general results in the paper it is shown that stability of the closed loop system is assured provided that the channel capacity is greater than the logarithm of the unstable eigenvalue, plus an additional correction term. The existence of a finite second moment in steady-state is established under additional conditions.

preprint2012arXiv

Tail asymptotics for busy periods

The busy period for a queue is cast as the area swept under the random walk until it first returns to zero, $B$. Encompassing non-i.i.d. increments, the large-deviations asymptotics of $B$ is addressed, under the assumption that the increments satisfy standard conditions, including a negative drift. The main conclusions provide insight on the probability of a large busy period, and the manner in which this occurs: I) The scaled probability of a large busy period has the asymptote, for any $b>0$, \lim_{n\to\infty} \frac{1}{\sqrt{n}} \log P(B\geq bn) = -K\sqrt{b}, \hbox{where} \quad K = 2 \sqrt{-\int_0^{λ^*} Λ(θ) dθ}, \quad \hbox{with $λ^*=\sup\{θ:Λ(θ)\leq0\}$,} and with $Λ$ denoting the scaled cumulant generating function of the increments process. II) The most likely path to a large swept area is found to be a simple rescaling of the path on $[0,1]$ given by, [ψ^*(t) = -Λ(λ^*(1-t))/λ^*.] In contrast to the piecewise linear most likely path leading the random walk to hit a high level, this is strictly concave in general. While these two most likely paths have very different forms, their derivatives coincide at the start of their trajectories, and at their first return to zero. These results partially answer an open problem of Kulick and Palmowski regarding the tail of the work done during a busy period at a single server queue. The paper concludes with applications of these results to the estimation of the busy period statistics $(λ^*, K)$ based on observations of the increments, offering the possibility of estimating the likelihood of a large busy period in advance of observing one.

preprint2010arXiv

Most likely paths to error when estimating the mean of a reflected random walk

It is known that simulation of the mean position of a Reflected Random Walk (RRW) $\{W_n\}$ exhibits non-standard behavior, even for light-tailed increment distributions with negative drift. The Large Deviation Principle (LDP) holds for deviations below the mean, but for deviations at the usual speed above the mean the rate function is null. This paper takes a deeper look at this phenomenon. Conditional on a large sample mean, a complete sample path LDP analysis is obtained. Let $I$ denote the rate function for the one dimensional increment process. If $I$ is coercive, then given a large simulated mean position, under general conditions our results imply that the most likely asymptotic behavior, $ψ^*$, of the paths $n^{-1} W_{\lfloor tn\rfloor}$ is to be zero apart from on an interval $[T_0,T_1]\subset[0,1]$ and to satisfy the functional equation \begin{align*} \nabla I\left(\ddtψ^*(t)\right)=λ^*(T_1-t) \quad \text{whenever } ψ(t)\neq 0. \end{align*} If $I$ is non-coercive, a similar, but slightly more involved, result holds. These results prove, in broad generality, that Monte Carlo estimates of the steady-state mean position of a RRW have a high likelihood of over-estimation. This has serious implications for the performance evaluation of queueing systems by simulation techniques where steady state expected queue-length and waiting time are key performance metrics. The results show that naïve estimates of these quantities from simulation are highly likely to be conservative.

preprint2009arXiv

Estimating Loynes' exponent

Loynes' distribution, which characterizes the one dimensional marginal of the stationary solution to Lindley's recursion, possesses an ultimately exponential tail for a large class of increment processes. If one can observe increments but does not know their probabilistic properties, what are the statistical limits of estimating the tail exponent of Loynes' distribution? We conjecture that in broad generality a consistent sequence of non-parametric estimators can be constructed that satisfies a large deviation principle. We present rigorous support for this conjecture under restrictive assumptions and simulation evidence indicating why we believe it to be true in greater generality.

Sean P. Meyn

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

Controlled Interacting Particle Algorithms for Simulation-based Reinforcement Learning

Convex Q-Learning, Part 1: Deterministic Optimal Control

Differential Temporal Difference Learning

Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning

Variance Reduction in Simulation of Multiclass Processing Networks

Zap Q-Learning With Nonlinear Function Approximation

Approximating a Diffusion by a Hidden Markov Model

Error Estimates for the Kernel Gain Function Approximation in the Feedback Particle Filter

Rationally inattentive control of Markov processes

Poisson's equation in nonlinear filtering

Feedback Particle Filter

Multivariable Feedback Particle Filter

Random-Time, State-Dependent Stochastic Drift for Markov Chains and Application to Stochastic Stabilization Over Erasure Channels

Tail asymptotics for busy periods

Most likely paths to error when estimating the mean of a reflected random walk

Estimating Loynes' exponent