Researcher profile

A. Rupam Mahmood

A. Rupam Mahmood contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2022arXiv

A Temporal-Difference Approach to Policy Gradient Estimation

The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.

preprint2022arXiv

An Alternate Policy Gradient Estimator for Softmax Policies

Policy gradient (PG) estimators are ineffective in dealing with softmax policies that are sub-optimally saturated, which refers to the situation when the policy concentrates its probability mass on sub-optimal actions. Sub-optimal policy saturation may arise from bad policy initialization or sudden changes in the environment that occur after the policy has already converged. Current softmax PG estimators require a large number of updates to overcome policy saturation, which causes low sample efficiency and poor adaptability to new situations. To mitigate this problem, we propose a novel PG estimator for softmax policies that utilizes the bias in the critic estimate and the noise present in the reward signal to escape the saturated regions of the policy parameter space. Our theoretical analysis and experiments, conducted on bandits and various reinforcement learning environments, show that this new estimator is significantly more robust to policy saturation.

preprint2022arXiv

Asynchronous Reinforcement Learning for Real-Time Control of Physical Robots

An oft-ignored challenge of real-world reinforcement learning is that the real world does not pause when agents make learning updates. As standard simulated environments do not address this real-time aspect of learning, most available implementations of RL algorithms process environment interactions and learning updates sequentially. As a consequence, when such implementations are deployed in the real world, they may make decisions based on significantly delayed observations and not act responsively. Asynchronous learning has been proposed to solve this issue, but no systematic comparison between sequential and asynchronous reinforcement learning was conducted using real-world environments. In this work, we set up two vision-based tasks with a robotic arm, implement an asynchronous learning system that extends a previous architecture, and compare sequential and asynchronous reinforcement learning across different action cycle times, sensory data dimensions, and mini-batch sizes. Our experiments show that when the time cost of learning updates increases, the action cycle time in sequential implementation could grow excessively long, while the asynchronous implementation can always maintain an appropriate action cycle time. Consequently, when learning updates are expensive, the performance of sequential learning diminishes and is outperformed by asynchronous learning by a substantial margin. Our system learns in real-time to reach and track visual targets from pixels within two hours of experience and does so directly using real robots, learning completely from scratch.

preprint2022arXiv

Continual Backprop: Stochastic Gradient Descent with Persistent Randomness

The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former. We show that in continual learning setups, Backprop performs well initially, but over time its performance degrades. Stochastic gradient descent alone is insufficient to learn continually; the initial randomness enables only initial learning but not continual learning. To the best of our knowledge, ours is the first result showing this degradation in Backprop's ability to learn. To address this degradation in Backprop's plasticity, we propose an algorithm that continually injects random features alongside gradient descent using a new generate-and-test process. We call this the \textit{Continual Backprop} algorithm. We show that, unlike Backprop, Continual Backprop is able to continually adapt in both supervised and reinforcement learning (RL) problems. Continual Backprop has the same computational complexity as Backprop and can be seen as a natural extension of Backprop for continual learning.

preprint2022arXiv

Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences

Approximate Policy Iteration (API) algorithms alternate between (approximate) policy evaluation and (approximate) greedification. Many different approaches have been explored for approximate policy evaluation, but less is understood about approximate greedification and what choices guarantee policy improvement. In this work, we investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values. In particular, we investigate the difference between the forward and reverse KL divergences, with varying degrees of entropy regularization. We show that the reverse KL has stronger policy improvement guarantees, but that reducing the forward KL can result in a worse policy. We also demonstrate, however, that a large enough reduction of the forward KL can induce improvement under additional assumptions. Empirically, we show on simple continuous-action environments that the forward KL can induce more exploration, but at the cost of a more suboptimal policy. No significant differences were observed in the discrete-action setting or on a suite of benchmark problems. Throughout, we highlight that many policy gradient methods can be seen as an instance of API, with either the forward or reverse KL for the policy update, and discuss next steps for understanding and improving our policy optimization algorithms.