Researcher profile

Peter Stone

Peter Stone contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
38works
0followers
14topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

38 published item(s)

preprint2026arXiv

VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon

Developing AI agents that can robustly adapt to varying strategic landscapes without retraining is a central challenge in multi-agent learning. Pokémon Video Game Championships (VGC) is a domain with a vast space of approximately $10^{139}$ team configurations, far larger than those of other games such as Chess, Go, Poker, StarCraft, or Dota. The combinatorial nature of team building in Pokémon VGC causes optimal strategies to vary substantially depending on both the controlled team and the opponent's team, making generalization uniquely challenging. To advance research on this problem, we introduce VGC-Bench: a benchmark that provides critical infrastructure, standardizes evaluation protocols, and supplies a human-play dataset of over 700,000 battle logs and a range of baseline agents based on heuristics, large language models, behavior cloning, and multi-agent reinforcement learning with empirical game-theoretic methods such as self-play, fictitious play, and double oracle. In the restricted setting where an agent is trained and evaluated in a mirror match with a single team configuration, our methods can win against a professional VGC competitor. We repeat this training and evaluation with progressively larger team sets and find that as the number of teams increases, the best-performing algorithm in the single-team setting has worse performance and is more exploitable, but has improved generalization to unseen teams. Our code and dataset are open-sourced at https://github.com/cameronangliss/vgc-bench and https://huggingface.co/datasets/cameronangliss/vgc-battle-logs.

preprint2024arXiv

Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents

Robustly cooperating with unseen agents and human partners presents significant challenges due to the diverse cooperative conventions these partners may adopt. Existing Ad Hoc Teamwork (AHT) methods address this challenge by training an agent with a population of diverse teammate policies obtained through maximizing specific diversity metrics. However, prior heuristic-based diversity metrics do not always maximize the agent's robustness in all cooperative problems. In this work, we first propose that maximizing an AHT agent's robustness requires it to emulate policies in the minimum coverage set (MCS), the set of best-response policies to any partner policies in the environment. We then introduce the L-BRDiv algorithm that generates a set of teammate policies that, when used for AHT training, encourage agents to emulate policies from the MCS. L-BRDiv works by solving a constrained optimization problem to jointly train teammate policies for AHT training and approximating AHT agent policies that are members of the MCS. We empirically demonstrate that L-BRDiv produces more robust AHT agents than state-of-the-art methods in a broader range of two-player cooperative problems without the need for extensive hyperparameter tuning for its objectives. Our study shows that L-BRDiv outperforms the baseline methods by prioritizing discovering distinct members of the MCS instead of repeatedly finding redundant policies.

preprint2023arXiv

Metric Residual Networks for Sample Efficient Goal-Conditioned Reinforcement Learning

Goal-conditioned reinforcement learning (GCRL) has a wide range of potential real-world applications, including manipulation and navigation problems in robotics. Especially in such robotics tasks, sample efficiency is of the utmost importance for GCRL since, by default, the agent is only rewarded when it reaches its goal. While several methods have been proposed to improve the sample efficiency of GCRL, one relatively under-studied approach is the design of neural architectures to support sample efficiency. In this work, we introduce a novel neural architecture for GCRL that achieves significantly better sample efficiency than the commonly-used monolithic network architecture. The key insight is that the optimal action-value function Q^*(s, a, g) must satisfy the triangle inequality in a specific sense. Furthermore, we introduce the metric residual network (MRN) that deliberately decomposes the action-value function Q(s,a,g) into the negated summation of a metric plus a residual asymmetric component. MRN provably approximates any optimal action-value function Q^*(s,a,g), thus making it a fitting neural architecture for GCRL. We conduct comprehensive experiments across 12 standard benchmark environments in GCRL. The empirical results demonstrate that MRN uniformly outperforms other state-of-the-art GCRL neural architectures in terms of sample efficiency.

preprint2023arXiv

Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks

In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. As usual, learning an optimal policy in this setting typically requires a large amount of training experiences. Reward shaping is a common approach for incorporating domain knowledge into reinforcement learning in order to speed up convergence to an optimal policy. However, to the best of our knowledge, the theoretical properties of reward shaping have thus far only been established in the discounted setting. This paper presents the first reward shaping framework for average-reward learning and proves that, under standard assumptions, the optimal policy under the original reward function can be recovered. In order to avoid the need for manual construction of the shaping function, we introduce a method for utilizing domain knowledge expressed as a temporal logic formula. The formula is automatically translated to a shaping function that provides additional reward throughout the learning process. We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines.

preprint2022arXiv

A Survey of Ad Hoc Teamwork Research

Ad hoc teamwork is the research problem of designing agents that can collaborate with new teammates without prior coordination. This survey makes a two-fold contribution: First, it provides a structured description of the different facets of the ad hoc teamwork problem. Second, it discusses the progress that has been made in the field so far, and identifies the immediate and long-term open problems that need to be addressed in ad hoc teamwork.

preprint2022arXiv

Adversarial Imitation Learning from Video using a State Observer

The imitation learning research community has recently made significant progress towards the goal of enabling artificial agents to imitate behaviors from video demonstrations alone. However, current state-of-the-art approaches developed for this problem exhibit high sample complexity due, in part, to the high-dimensional nature of video observations. Towards addressing this issue, we introduce here a new algorithm called Visual Generative Adversarial Imitation from Observation using a State Observer VGAIfO-SO. At its core, VGAIfO-SO seeks to address sample inefficiency using a novel, self-supervised state observer, which provides estimates of lower-dimensional proprioceptive state representations from high-dimensional images. We show experimentally in several continuous control environments that VGAIfO-SO is more sample efficient than other IfO algorithms at learning from video-only demonstrations and can sometimes even achieve performance close to the Generative Adversarial Imitation from Observation (GAIfO) algorithm that has privileged access to the demonstrator's proprioceptive state information.

preprint2022arXiv

APPL: Adaptive Planner Parameter Learning

While current autonomous navigation systems allow robots to successfully drive themselves from one point to another in specific environments, they typically require extensive manual parameter re-tuning by human robotics experts in order to function in new environments. Furthermore, even for just one complex environment, a single set of fine-tuned parameters may not work well in different regions of that environment. These problems prohibit reliable mobile robot deployment by non-expert users. As a remedy, we propose Adaptive Planner Parameter Learning (APPL), a machine learning framework that can leverage non-expert human interaction via several modalities -- including teleoperated demonstrations, corrective interventions, and evaluative feedback -- and also unsupervised reinforcement learning to learn a parameter policy that can dynamically adjust the parameters of classical navigation systems in response to changes in the environment. APPL inherits safety and explainability from classical navigation systems while also enjoying the benefits of machine learning, i.e., the ability to adapt and improve from experience. We present a suite of individual APPL methods and also a unifying cycle-of-learning scheme that combines all the proposed methods in a framework that can improve navigation performance through continual, iterative human interaction and simulation training.

preprint2022arXiv

Autonomous Ground Navigation in Highly Constrained Spaces: Lessons learned from The BARN Challenge at ICRA 2022

The BARN (Benchmark Autonomous Robot Navigation) Challenge took place at the 2022 IEEE International Conference on Robotics and Automation (ICRA 2022) in Philadelphia, PA. The aim of the challenge was to evaluate state-of-the-art autonomous ground navigation systems for moving robots through highly constrained environments in a safe and efficient manner. Specifically, the task was to navigate a standardized, differential-drive ground robot from a predefined start location to a goal location as quickly as possible without colliding with any obstacles, both in simulation and in the real world. Five teams from all over the world participated in the qualifying simulation competition, three of which were invited to compete with each other at a set of physical obstacle courses at the conference center in Philadelphia. The competition results suggest that autonomous ground navigation in highly constrained spaces, despite seeming ostensibly simple even for experienced roboticists, is actually far from being a solved problem. In this article, we discuss the challenge, the approaches used by the top three winning teams, and lessons learned to direct future research.

preprint2022arXiv

Bottom-Up Skill Discovery from Unsegmented Demonstrations for Long-Horizon Robot Manipulation

We tackle real-world long-horizon robot manipulation tasks through skill discovery. We present a bottom-up approach to learning a library of reusable skills from unsegmented demonstrations and use these skills to synthesize prolonged robot behaviors. Our method starts with constructing a hierarchical task structure from each demonstration through agglomerative clustering. From the task structures of multi-task demonstrations, we identify skills based on the recurring patterns and train goal-conditioned sensorimotor policies with hierarchical imitation learning. Finally, we train a meta controller to compose these skills to solve long-horizon manipulation tasks. The entire model can be trained on a small set of human demonstrations collected within 30 minutes without further annotations, making it amendable to real-world deployment. We systematically evaluated our method in simulation environments and on a real robot. Our method has shown superior performance over state-of-the-art imitation learning methods in multi-stage manipulation tasks. Furthermore, skills discovered from multi-task demonstrations boost the average task success by $8\%$ compared to those discovered from individual tasks.

preprint2022arXiv

Causal Dynamics Learning for Task-Independent State Abstraction

Learning dynamics models accurately is an important goal for Model-Based Reinforcement Learning (MBRL), but most MBRL methods learn a dense dynamics model which is vulnerable to spurious correlations and therefore generalizes poorly to unseen states. In this paper, we introduce Causal Dynamics Learning for Task-Independent State Abstraction (CDL), which first learns a theoretically proved causal dynamics model that removes unnecessary dependencies between state variables and the action, thus generalizing well to unseen states. A state abstraction can then be derived from the learned dynamics, which not only improves sample efficiency but also applies to a wider range of tasks than existing state abstraction methods. Evaluated on two simulated environments and downstream tasks, both the dynamics model and policies learned by the proposed method generalize well to unseen states and the derived state abstraction improves sample efficiency compared to learning without it.

preprint2022arXiv

Conflict Avoidance in Social Navigation -- a Survey

A major goal in robotics is to enable intelligent mobile robots to operate smoothly in shared human-robot environments. One of the most fundamental capabilities in service of this goal is competent navigation in this ``social" context. As a result, there has been a recent surge of research on social navigation; and especially as it relates to the handling of conflicts between agents during social navigation. These developments introduce a variety of models and algorithms, however as this research area is inherently interdisciplinary, many of the relevant papers are not comparable and there is no shared standard vocabulary. This survey aims to bridge this gap by introducing such a common language, using it to survey existing work, and highlighting open problems. It starts by defining the boundaries of this survey to a limited, yet highly common type of social navigation - conflict avoidance. Within this proposed scope, this survey introduces a detailed taxonomy of the conflict avoidance components. This survey then maps existing work into this taxonomy, while discussing papers using its framing. Finally, this paper proposes some future research directions and open problems that are currently on the frontier of social navigation to aid ongoing and future research.

preprint2022arXiv

Continual Learning and Private Unlearning

As intelligent agents become autonomous over longer periods of time, they may eventually become lifelong counterparts to specific people. If so, it may be common for a user to want the agent to master a task temporarily but later on to forget the task due to privacy concerns. However enabling an agent to \emph{forget privately} what the user specified without degrading the rest of the learned knowledge is a challenging problem. With the aim of addressing this challenge, this paper formalizes this continual learning and private unlearning (CLPU) problem. The paper further introduces a straightforward but exactly private solution, CLPU-DER++, as the first step towards solving the CLPU problem, along with a set of carefully designed benchmark problems to evaluate the effectiveness of the proposed solution. The code is available at https://github.com/Cranial-XIX/Continual-Learning-Private-Unlearning.

preprint2022arXiv

Dynamic Sparse Training for Deep Reinforcement Learning

Deep reinforcement learning (DRL) agents are trained through trial-and-error interactions with the environment. This leads to a long training time for dense neural networks to achieve good performance. Hence, prohibitive computation and memory resources are consumed. Recently, learning efficient DRL agents has received increasing attention. Yet, current methods focus on accelerating inference time. In this paper, we introduce for the first time a dynamic sparse training approach for deep reinforcement learning to accelerate the training process. The proposed approach trains a sparse neural network from scratch and dynamically adapts its topology to the changing data distribution during training. Experiments on continuous control tasks show that our dynamic sparse agents achieve higher performance than the equivalent dense methods, reduce the parameter count and floating-point operations (FLOPs) by 50%, and have a faster learning speed that enables reaching the performance of dense agents with 40-50% reduction in the training steps.

preprint2022arXiv

Effective Mutation Rate Adaptation through Group Elite Selection

Evolutionary algorithms are sensitive to the mutation rate (MR); no single value of this parameter works well across domains. Self-adaptive MR approaches have been proposed but they tend to be brittle: Sometimes they decay the MR to zero, thus halting evolution. To make self-adaptive MR robust, this paper introduces the Group Elite Selection of Mutation Rates (GESMR) algorithm. GESMR co-evolves a population of solutions and a population of MRs, such that each MR is assigned to a group of solutions. The resulting best mutational change in the group, instead of average mutational change, is used for MR selection during evolution, thus avoiding the vanishing MR problem. With the same number of function evaluations and with almost no overhead, GESMR converges faster and to better solutions than previous approaches on a wide range of continuous test optimization problems. GESMR also scales well to high-dimensional neuroevolution for supervised image-classification tasks and for reinforcement learning control tasks. Remarkably, GESMR produces MRs that are optimal in the long-term, as demonstrated through a comprehensive look-ahead grid search. Thus, GESMR and its theoretical and empirical analysis demonstrate how self-adaptation can be harnessed to improve performance in several applications of evolutionary computation.

preprint2022arXiv

Motion Planning and Control for Mobile Robot Navigation Using Machine Learning: a Survey

Moving in complex environments is an essential capability of intelligent mobile robots. Decades of research and engineering have been dedicated to developing sophisticated navigation systems to move mobile robots from one point to another. Despite their overall success, a recently emerging research thrust is devoted to developing machine learning techniques to address the same problem, based in large part on the success of deep learning. However, to date, there has not been much direct comparison between the classical and emerging paradigms to this problem. In this article, we survey recent works that apply machine learning for motion planning and control in mobile robot navigation, within the context of classical navigation systems. The surveyed works are classified into different categories, which delineate the relationship of the learning approaches to classical methods. Based on this classification, we identify common challenges and promising future directions.

preprint2022arXiv

Real-world challenges for multi-agent reinforcement learning in grid-interactive buildings

Building upon prior research that highlighted the need for standardizing environments for building control research, and inspired by recently introduced challenges for real life reinforcement learning control, here we propose a non-exhaustive set of nine real world challenges for reinforcement learning control in grid-interactive buildings. We argue that research in this area should be expressed in this framework in addition to providing a standardized environment for repeatability. Advanced controllers such as model predictive control and reinforcement learning (RL) control have both advantages and disadvantages that prevent them from being implemented in real world problems. Comparisons between the two are rare, and often biased. By focusing on the challenges, we can investigate the performance of the controllers under a variety of situations and generate a fair comparison. As a demonstration, we implement the offline learning challenge in CityLearn and study the impact of different levels of domain knowledge and complexity of RL algorithms. We show that the sequence of operations utilized in a rule based controller (RBC) used for offline training affects the performance of the RL agents when evaluated on a set of four energy flexibility metrics. Longer offline learning from an optimized RBC leads to improved performance in the long run. RL agents that learn from a simplified RBC risk poorer performance as the offline learning period increases. We also observe no impact on performance from information sharing amongst agents. We call for a more interdisciplinary effort of the research community to address the real world challenges, and unlock the potential of grid-interactive building

preprint2022arXiv

Reward (Mis)design for Autonomous Driving

This article considers the problem of diagnosing certain common errors in reward design. Its insights are also applicable to the design of cost functions and performance metrics more generally. To diagnose common errors, we develop 8 simple sanity checks for identifying flaws in reward functions. These sanity checks are applied to reward functions from past work on reinforcement learning (RL) for autonomous driving (AD), revealing near-universal flaws in reward design for AD that might also exist pervasively across reward design for other tasks. Lastly, we explore promising directions that may aid the design of reward functions for AD in subsequent research, following a process of inquiry that can be adapted to other domains.

preprint2022arXiv

Scalable Multiagent Driving Policies For Reducing Traffic Congestion

Traffic congestion is a major challenge in modern urban settings. The industry-wide development of autonomous and automated vehicles (AVs) motivates the question of how can AVs contribute to congestion reduction. Past research has shown that in small scale mixed traffic scenarios with both AVs and human-driven vehicles, a small fraction of AVs executing a controlled multiagent driving policy can mitigate congestion. In this paper, we scale up existing approaches and develop new multiagent driving policies for AVs in scenarios with greater complexity. We start by showing that a congestion metric used by past research is manipulable in open road network scenarios where vehicles dynamically join and leave the road. We then propose using a different metric that is robust to manipulation and reflects open network traffic efficiency. Next, we propose a modular transfer reinforcement learning approach, and use it to scale up a multiagent driving policy to outperform human-like traffic and existing approaches in a simulated realistic scenario, which is an order of magnitude larger than past scenarios (hundreds instead of tens of vehicles). Additionally, our modular transfer learning approach saves up to 80% of the training time in our experiments, by focusing its data collection on key locations in the network. Finally, we show for the first time a distributed multiagent policy that improves congestion over human-driven traffic. The distributed approach is more realistic and practical, as it relies solely on existing sensing and actuation capabilities, and does not require adding new communication infrastructure.

preprint2022arXiv

Skeletal Feature Compensation for Imitation Learning with Embodiment Mismatch

Learning from demonstrations in the wild (e.g. YouTube videos) is a tantalizing goal in imitation learning. However, for this goal to be achieved, imitation learning algorithms must deal with the fact that the demonstrators and learners may have bodies that differ from one another. This condition -- "embodiment mismatch" -- is ignored by many recent imitation learning algorithms. Our proposed imitation learning technique, SILEM (\textbf{S}keletal feature compensation for \textbf{I}mitation \textbf{L}earning with \textbf{E}mbodiment \textbf{M}ismatch), addresses a particular type of embodiment mismatch by introducing a learned affine transform to compensate for differences in the skeletal features obtained from the learner and expert. We create toy domains based on PyBullet's HalfCheetah and Ant to assess SILEM's benefits for this type of embodiment mismatch. We also provide qualitative and quantitative results on more realistic problems -- teaching simulated humanoid agents, including Atlas from Boston Dynamics, to walk by observing human demonstrations.

preprint2022arXiv

Socially Compliant Navigation Dataset (SCAND): A Large-Scale Dataset of Demonstrations for Social Navigation

Social navigation is the capability of an autonomous agent, such as a robot, to navigate in a 'socially compliant' manner in the presence of other intelligent agents such as humans. With the emergence of autonomously navigating mobile robots in human populated environments (e.g., domestic service robots in homes and restaurants and food delivery robots on public sidewalks), incorporating socially compliant navigation behaviors on these robots becomes critical to ensuring safe and comfortable human robot coexistence. To address this challenge, imitation learning is a promising framework, since it is easier for humans to demonstrate the task of social navigation rather than to formulate reward functions that accurately capture the complex multi objective setting of social navigation. The use of imitation learning and inverse reinforcement learning to social navigation for mobile robots, however, is currently hindered by a lack of large scale datasets that capture socially compliant robot navigation demonstrations in the wild. To fill this gap, we introduce Socially CompliAnt Navigation Dataset (SCAND) a large scale, first person view dataset of socially compliant navigation demonstrations. Our dataset contains 8.7 hours, 138 trajectories, 25 miles of socially compliant, human teleoperated driving demonstrations that comprises multi modal data streams including 3D lidar, joystick commands, odometry, visual and inertial information, collected on two morphologically different mobile robots a Boston Dynamics Spot and a Clearpath Jackal by four different human demonstrators in both indoor and outdoor environments. We additionally perform preliminary analysis and validation through real world robot experiments and show that navigation policies learned by imitation learning on SCAND generate socially compliant behaviors

preprint2022arXiv

VI-IKD: High-Speed Accurate Off-Road Navigation using Learned Visual-Inertial Inverse Kinodynamics

One of the key challenges in high speed off road navigation on ground vehicles is that the kinodynamics of the vehicle terrain interaction can differ dramatically depending on the terrain. Previous approaches to addressing this challenge have considered learning an inverse kinodynamics (IKD) model, conditioned on inertial information of the vehicle to sense the kinodynamic interactions. In this paper, we hypothesize that to enable accurate high-speed off-road navigation using a learned IKD model, in addition to inertial information from the past, one must also anticipate the kinodynamic interactions of the vehicle with the terrain in the future. To this end, we introduce Visual-Inertial Inverse Kinodynamics (VI-IKD), a novel learning based IKD model that is conditioned on visual information from a terrain patch ahead of the robot in addition to past inertial information, enabling it to anticipate kinodynamic interactions in the future. We validate the effectiveness of VI-IKD in accurate high-speed off-road navigation experimentally on a scale 1/5 UT-AlphaTruck off-road autonomous vehicle in both indoor and outdoor environments and show that compared to other state-of-the-art approaches, VI-IKD enables more accurate and robust off-road navigation on a variety of different terrains at speeds of up to 3.5 m/s.

preprint2022arXiv

Visually Grounded Task and Motion Planning for Mobile Manipulation

Task and motion planning (TAMP) algorithms aim to help robots achieve task-level goals, while maintaining motion-level feasibility. This paper focuses on TAMP domains that involve robot behaviors that take extended periods of time (e.g., long-distance navigation). In this paper, we develop a visual grounding approach to help robots probabilistically evaluate action feasibility, and introduce a TAMP algorithm, called GROP, that optimizes both feasibility and efficiency. We have collected a dataset that includes 96,000 simulated trials of a robot conducting mobile manipulation tasks, and then used the dataset to learn to ground symbolic spatial relationships for action feasibility evaluation. Compared with competitive TAMP baselines, GROP exhibited a higher task-completion rate while maintaining lower or comparable action costs. In addition to these extensive experiments in simulation, GROP is fully implemented and tested on a real robot system.

preprint2021arXiv

A Lifelong Learning Approach to Mobile Robot Navigation

This paper presents a self-improving lifelong learning framework for a mobile robot navigating in different environments. Classical static navigation methods require environment-specific in-situ system adjustment, e.g. from human experts, or may repeat their mistakes regardless of how many times they have navigated in the same environment. Having the potential to improve with experience, learning-based navigation is highly dependent on access to training resources, e.g. sufficient memory and fast computation, and is prone to forgetting previously learned capability, especially when facing different environments. In this work, we propose Lifelong Learning for Navigation (LLfN) which (1) improves a mobile robot's navigation behavior purely based on its own experience, and (2) retains the robot's capability to navigate in previous environments after learning in new ones. LLfN is implemented and tested entirely onboard a physical robot with a limited memory and computation budget.

preprint2021arXiv

Agile Robot Navigation through Hallucinated Learning and Sober Deployment

Learning from Hallucination (LfH) is a recent machine learning paradigm for autonomous navigation, which uses training data collected in completely safe environments and adds numerous imaginary obstacles to make the environment densely constrained, to learn navigation planners that produce feasible navigation even in highly constrained (more dangerous) spaces. However, LfH requires hallucinating the robot perception during deployment to match with the hallucinated training data, which creates a need for sometimes-infeasible prior knowledge and tends to generate very conservative planning. In this work, we propose a new LfH paradigm that does not require runtime hallucination -- a feature we call "sober deployment" -- and can therefore adapt to more realistic navigation scenarios. This novel Hallucinated Learning and Sober Deployment (HLSD) paradigm is tested in a benchmark testbed of 300 simulated navigation environments with a wide range of difficulty levels, and in the real-world. In most cases, HLSD outperforms both the original LfH method and a classical navigation planner.

preprint2021arXiv

RAIL: A modular framework for Reinforcement-learning-based Adversarial Imitation Learning

While Adversarial Imitation Learning (AIL) algorithms have recently led to state-of-the-art results on various imitation learning benchmarks, it is unclear as to what impact various design decisions have on performance. To this end, we present here an organizing, modular framework called Reinforcement-learning-based Adversarial Imitation Learning (RAIL) that encompasses and generalizes a popular subclass of existing AIL approaches. Using the view espoused by RAIL, we create two new IfO (Imitation from Observation) algorithms, which we term SAIfO: SAC-based Adversarial Imitation from Observation and SILEM (Skeletal Feature Compensation for Imitation Learning with Embodiment Mismatch). We go into greater depth about SILEM in a separate technical report. In this paper, we focus on SAIfO, evaluating it on a suite of locomotion tasks from OpenAI Gym, and showing that it outperforms contemporaneous RAIL algorithms that perform IfO.

preprint2021arXiv

Toward Agile Maneuvers in Highly Constrained Spaces: Learning from Hallucination

While classical approaches to autonomous robot navigation currently enable operation in certain environments, they break down in tightly constrained spaces, e.g., where the robot needs to engage in agile maneuvers to squeeze between obstacles. Recent machine learning techniques have the potential to address this shortcoming, but existing approaches require vast amounts of navigation experience for training, during which the robot must operate in close proximity to obstacles and risk collision. In this paper, we propose to side-step this requirement by introducing a new machine learning paradigm for autonomous navigation called learning from hallucination (LfH), which can use training data collected in completely safe environments to compute navigation controllers that result in fast, smooth, and safe navigation in highly constrained environments. Our experimental results show that the proposed LfH system outperforms three autonomous navigation baselines on a real robot and generalizes well to unseen environments, including those based on both classical and machine learning techniques.

preprint2021arXiv

VOILA: Visual-Observation-Only Imitation Learning for Autonomous Navigation

While imitation learning for vision based autonomous mobile robot navigation has recently received a great deal of attention in the research community, existing approaches typically require state action demonstrations that were gathered using the deployment platform. However, what if one cannot easily outfit their platform to record these demonstration signals or worse yet the demonstrator does not have access to the platform at all? Is imitation learning for vision based autonomous navigation even possible in such scenarios? In this work, we hypothesize that the answer is yes and that recent ideas from the Imitation from Observation (IfO) literature can be brought to bear such that a robot can learn to navigate using only ego centric video collected by a demonstrator, even in the presence of viewpoint mismatch. To this end, we introduce a new algorithm, Visual Observation only Imitation Learning for Autonomous navigation (VOILA), that can successfully learn navigation policies from a single video demonstration collected from a physically different agent. We evaluate VOILA in the photorealistic AirSim simulator and show that VOILA not only successfully imitates the expert, but that it also learns navigation policies that can generalize to novel environments. Further, we demonstrate the effectiveness of VOILA in a real world setting by showing that it allows a wheeled Jackal robot to successfully imitate a human walking in an environment using a video recorded using a mobile phone camera.

preprint2020arXiv

An Imitation from Observation Approach to Transfer Learning with Dynamics Mismatch

We examine the problem of transferring a policy learned in a source environment to a target environment with different dynamics, particularly in the case where it is critical to reduce the amount of interaction with the target environment during learning. This problem is particularly important in sim-to-real transfer because simulators inevitably model real-world dynamics imperfectly. In this paper, we show that one existing solution to this transfer problem - grounded action transformation - is closely related to the problem of imitation from observation (IfO): learning behaviors that mimic the observations of behavior demonstrations. After establishing this relationship, we hypothesize that recent state-of-the-art approaches from the IfO literature can be effectively repurposed for grounded transfer learning.To validate our hypothesis we derive a new algorithm - generative adversarial reinforced action transformation (GARAT) - based on adversarial imitation from observation techniques. We run experiments in several domains with mismatched dynamics, and find that agents trained with GARAT achieve higher returns in the target environment compared to existing black-box transfer methods

preprint2020arXiv

APPLD: Adaptive Planner Parameter Learning from Demonstration

Existing autonomous robot navigation systems allow robots to move from one point to another in a collision-free manner. However, when facing new environments, these systems generally require re-tuning by expert roboticists with a good understanding of the inner workings of the navigation system. In contrast, even users who are unversed in the details of robot navigation algorithms can generate desirable navigation behavior in new environments via teleoperation. In this paper, we introduce APPLD, Adaptive Planner Parameter Learning from Demonstration, that allows existing navigation systems to be successfully applied to new complex environments, given only a human teleoperated demonstration of desirable navigation. APPLD is verified on two robots running different navigation systems in different environments. Experimental results show that APPLD can outperform navigation systems with the default and expert-tuned parameters, and even the human demonstrator themselves.

preprint2020arXiv

Artificial Musical Intelligence: A Survey

Computers have been used to analyze and create music since they were first introduced in the 1950s and 1960s. Beginning in the late 1990s, the rise of the Internet and large scale platforms for music recommendation and retrieval have made music an increasingly prevalent domain of machine learning and artificial intelligence research. While still nascent, several different approaches have been employed to tackle what may broadly be referred to as "musical intelligence." This article provides a definition of musical intelligence, introduces a taxonomy of its constituent components, and surveys the wide range of AI methods that can be, and have been, brought to bear in its pursuit, with a particular emphasis on machine learning methods.

preprint2020arXiv

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Reinforcement learning (RL) is a popular paradigm for addressing sequential decision tasks in which the agent has only limited environmental feedback. Despite many advances over the past three decades, learning in many domains still requires a large amount of interaction with the environment, which can be prohibitively expensive in realistic scenarios. To address this problem, transfer learning has been applied to reinforcement learning such that experience gained in one task can be leveraged when starting to learn the next, harder task. More recently, several lines of research have explored how tasks, or data samples themselves, can be sequenced into a curriculum for the purpose of learning a problem that may otherwise be too difficult to learn from scratch. In this article, we present a framework for curriculum learning (CL) in reinforcement learning, and use it to survey and classify existing CL methods in terms of their assumptions, capabilities, and goals. Finally, we use our framework to find open problems and suggest directions for future RL curriculum learning research.

preprint2020arXiv

Deep R-Learning for Continual Area Sweeping

Coverage path planning is a well-studied problem in robotics in which a robot must plan a path that passes through every point in a given area repeatedly, usually with a uniform frequency. To address the scenario in which some points need to be visited more frequently than others, this problem has been extended to non-uniform coverage planning. This paper considers the variant of non-uniform coverage in which the robot does not know the distribution of relevant events beforehand and must nevertheless learn to maximize the rate of detecting events of interest. This continual area sweeping problem has been previously formalized in a way that makes strong assumptions about the environment, and to date only a greedy approach has been proposed. We generalize the continual area sweeping formulation to include fewer environmental constraints, and propose a novel approach based on reinforcement learning in a Semi-Markov Decision Process. This approach is evaluated in an abstract simulation and in a high fidelity Gazebo simulation. These evaluations show significant improvement upon the existing approach in general settings, which is especially relevant in the growing area of service robotics.

preprint2020arXiv

Learning and Reasoning for Robot Dialog and Navigation Tasks

Reinforcement learning and probabilistic reasoning algorithms aim at learning from interaction experiences and reasoning with probabilistic contextual knowledge respectively. In this research, we develop algorithms for robot task completions, while looking into the complementary strengths of reinforcement learning and probabilistic reasoning techniques. The robots learn from trial-and-error experiences to augment their declarative knowledge base, and the augmented knowledge can be used for speeding up the learning process in potentially different tasks. We have implemented and evaluated the developed algorithms using mobile robots conducting dialog and navigation tasks. From the results, we see that our robot's performance can be improved by both reasoning with human knowledge and learning from task-completion experience. More interestingly, the robot was able to learn from navigation tasks to improve its dialog strategies.

preprint2020arXiv

Policy Evaluation in Continuous MDPs with Efficient Kernelized Gradient Temporal Difference

We consider policy evaluation in infinite-horizon discounted Markov decision problems (MDPs) with infinite spaces. We reformulate this task a compositional stochastic program with a function-valued decision variable that belongs to a reproducing kernel Hilbert space (RKHS). We approach this problem via a new functional generalization of stochastic quasi-gradient methods operating in tandem with stochastic sparse subspace projections. The result is an extension of gradient temporal difference learning that yields nonlinearly parameterized value function estimates of the solution to the Bellman evaluation equation. Our main contribution is a memory-efficient non-parametric stochastic method guaranteed to converge exactly to the Bellman fixed point with probability $1$ with attenuating step-sizes. Further, with constant step-sizes, we obtain mean convergence to a neighborhood and that the value function estimates have finite complexity. In the Mountain Car domain, we observe faster convergence to lower Bellman error solutions than existing approaches with a fraction of the required memory.

preprint2020arXiv

Reducing Sampling Error in Batch Temporal Difference Learning

Temporal difference (TD) learning is one of the main foundations of modern reinforcement learning. This paper studies the use of TD(0), a canonical TD algorithm, to estimate the value function of a given policy from a batch of data. In this batch setting, we show that TD(0) may converge to an inaccurate value function because the update following an action is weighted according to the number of times that action occurred in the batch -- not the true probability of the action under the given policy. To address this limitation, we introduce \textit{policy sampling error corrected}-TD(0) (PSEC-TD(0)). PSEC-TD(0) first estimates the empirical distribution of actions in each state in the batch and then uses importance sampling to correct for the mismatch between the empirical weighting and the correct weighting for updates following each action. We refine the concept of a certainty-equivalence estimate and argue that PSEC-TD(0) is a more data efficient estimator than TD(0) for a fixed batch of data. Finally, we conduct an empirical evaluation of PSEC-TD(0) on three batch value function learning tasks, with a hyperparameter sensitivity analysis, and show that PSEC-TD(0) produces value function estimates with lower mean squared error than TD(0).

preprint2020arXiv

Reinforced Grounded Action Transformation for Sim-to-Real Transfer

Robots can learn to do complex tasks in simulation, but often, learned behaviors fail to transfer well to the real world due to simulator imperfections (the reality gap). Some existing solutions to this sim-to-real problem, such as Grounded Action Transformation (GAT), use a small amount of real-world experience to minimize the reality gap by grounding the simulator. While very effective in certain scenarios, GAT is not robust on problems that use complex function approximation techniques to model a policy. In this paper, we introduce Reinforced Grounded Action Transformation(RGAT), a new sim-to-real technique that uses Reinforcement Learning (RL) not only to update the target policy in simulation, but also to perform the grounding step itself. This novel formulation allows for end-to-end training during the grounding step, which, compared to GAT, produces a better grounded simulator. Moreover, we show experimentally in several MuJoCo domains that our approach leads to successful transfer for policies modeled using neural networks.

preprint2020arXiv

RIDM: Reinforced Inverse Dynamics Modeling for Learning from a Single Observed Demonstration

Augmenting reinforcement learning with imitation learning is often hailed as a method by which to improve upon learning from scratch. However, most existing methods for integrating these two techniques are subject to several strong assumptions---chief among them that information about demonstrator actions is available. In this paper, we investigate the extent to which this assumption is necessary by introducing and evaluating reinforced inverse dynamics modeling (RIDM), a novel paradigm for combining imitation from observation (IfO) and reinforcement learning with no dependence on demonstrator action information. Moreover, RIDM requires only a single demonstration trajectory and is able to operate directly on raw (unaugmented) state features. We find experimentally that RIDM performs favorably compared to a baseline approach for several tasks in simulation as well as for tasks on a real UR5 robot arm. Experiment videos can be found at https://sites.google.com/view/ridm-reinforced-inverse-dynami.

preprint2020arXiv

Stochastic Grounded Action Transformation for Robot Learning in Simulation

Robot control policies learned in simulation do not often transfer well to the real world. Many existing solutions to this sim-to-real problem, such as the Grounded Action Transformation (GAT) algorithm, seek to correct for or ground these differences by matching the simulator to the real world. However, the efficacy of these approaches is limited if they do not explicitly account for stochasticity in the target environment. In this work, we analyze the problems associated with grounding a deterministic simulator in a stochastic real world environment, and we present examples where GAT fails to transfer a good policy due to stochastic transitions in the target domain. In response, we introduce the Stochastic Grounded Action Transformation(SGAT) algorithm,which models this stochasticity when grounding the simulator. We find experimentally for both simulated and physical target domains that SGAT can find policies that are robust to stochasticity in the target domain