Source author record

J. Andrew Bagnell

J. Andrew Bagnell appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Robotics Artificial Intelligence Computer Science and Game Theory Computer Vision Systems and Control eess.SY Multiagent Systems

Catalog footprint

What is connected

34works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Minimax Optimal Online Imitation Learning via Replay Estimation

Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.

preprint2023arXiv

Sequence Model Imitation Learning with Unobserved Contexts

We consider imitation learning problems where the learner's ability to mimic the expert increases throughout the course of an episode as more information is revealed. One example of this is when the expert has access to privileged information: while the learner might not be able to accurately reproduce expert behavior early on in an episode, by considering the entire history of states and actions, they might be able to eventually identify the hidden context and act as the expert would. We prove that on-policy imitation learning algorithms (with or without access to a queryable expert) are better equipped to handle these sorts of asymptotically realizable problems than off-policy methods. This is because on-policy algorithms provably learn to recover from their initially suboptimal actions, while off-policy methods treat their suboptimal past actions as though they came from the expert. This often manifests as a latching behavior: a naive repetition of past actions. We conduct experiments in a toy bandit domain that show that there exist sharp phase transitions of whether off-policy approaches are able to match expert performance asymptotically, in contrast to the uniformly good performance of on-policy approaches. We demonstrate that on several continuous control tasks, on-policy approaches are able to use history to identify the context while off-policy approaches actually perform worse when given access to history.

preprint2022arXiv

Causal Imitation Learning under Temporally Correlated Noise

We develop algorithms for imitation learning from policy data that was corrupted by temporally correlated noise in expert actions. When noise affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch on to, leading to poor policy performance. To break up these spurious correlations, we apply modern variants of the instrumental variable regression (IVR) technique of econometrics, enabling us to recover the underlying policy without requiring access to an interactive expert. In particular, we present two techniques, one of a generative-modeling flavor (DoubIL) that can utilize access to a simulator, and one of a game-theoretic flavor (ResiduIL) that can be run entirely offline. We find both of our algorithms compare favorably to behavioral cloning on simulated control tasks.

preprint2022arXiv

Game-Theoretic Algorithms for Conditional Moment Matching

A variety of problems in econometrics and machine learning, including instrumental variable regression and Bellman residual minimization, can be formulated as satisfying a set of conditional moment restrictions (CMR). We derive a general, game-theoretic strategy for satisfying CMR that scales to nonlinear problems, is amenable to gradient-based optimization, and is able to account for finite sample uncertainty. We recover the approaches of Dikkala et al. and Dai et al. as special cases of our general framework before detailing various extensions and how to efficiently solve the game defined by CMR.

preprint2021arXiv

Feedback in Imitation Learning: The Three Regimes of Covariate Shift

Imitation learning practitioners have often noted that conditioning policies on previous actions leads to a dramatic divergence between "held out" error and performance of the learner in situ. Interactive approaches can provably address this divergence but require repeated querying of a demonstrator. Recent work identifies this divergence as stemming from a "causal confound" in predicting the current action, and seek to ablate causal aspects of current state using tools from causal inference. In this work, we argue instead that this divergence is simply another manifestation of covariate shift, exacerbated particularly by settings of feedback between decisions and input features. The learner often comes to rely on features that are strongly predictive of decisions, but are subject to strong covariate shift. Our work demonstrates a broad class of problems where this shift can be mitigated, both theoretically and practically, by taking advantage of a simulator but without any further querying of expert demonstration. We analyze existing benchmarks used to test imitation learning approaches and find that these benchmarks are realizable and simple and thus insufficient for capturing the harder regimes of error compounding seen in real-world decision making problems. We find, in a surprising contrast with previous literature, but consistent with our theory, that naive behavioral cloning provides excellent results. We detail the need for new standardized benchmarks that capture the phenomena seen in robotics problems.

preprint2020arXiv

Exploration in Action Space

Parameter space exploration methods with black-box optimization have recently been shown to outperform state-of-the-art approaches in continuous control reinforcement learning domains. In this paper, we examine reasons why these methods work better and the situations in which they are worse than traditional action space exploration methods. Through a simple theoretical analysis, we show that when the parametric complexity required to solve the reinforcement learning problem is greater than the product of action space dimensionality and horizon length, exploration in action space is preferred. This is also shown empirically by comparing simple exploration methods on several toy problems.

preprint2020arXiv

TRON: A Fast Solver for Trajectory Optimization with Non-Smooth Cost Functions

Trajectory optimization is an important tool for control and planning of complex, underactuated robots, and has shown impressive results in real world robotic tasks. However, in applications where the cost function to be optimized is non-smooth, modern trajectory optimization methods have extremely slow convergence. In this work, we present TRON, an iterative solver that can be used for efficient trajectory optimization in applications with non-smooth cost functions that are composed of smooth components. TRON achieves this by exploiting the structure of the objective to adaptively smooth the cost function, resulting in a sequence of objectives that can be efficiently optimized. TRON is provably guaranteed to converge to the global optimum of the non-smooth convex cost function when the dynamics are linear, and to a stationary point when the dynamics are nonlinear. Empirically, we show that TRON has faster convergence and lower final costs when compared to other trajectory optimization methods on a range of simulated tasks including collision-free motion planning for a mobile robot, sparse optimal control for surgical needle, and a satellite rendezvous problem.

preprint2016arXiv

A Convex Polynomial Force-Motion Model for Planar Sliding: Identification and Application

We propose a polynomial force-motion model for planar sliding. The set of generalized friction loads is the 1-sublevel set of a polynomial whose gradient directions correspond to generalized velocities. Additionally, the polynomial is confined to be convex even-degree homogeneous in order to obey the maximum work inequality, symmetry, shape invariance in scale, and fast invertibility. We present a simple and statistically-efficient model identification procedure using a sum-of-squares convex relaxation. Simulation and robotic experiments validate the accuracy and efficiency of our approach. We also show practical applications of our model including stable pushing of objects and free sliding dynamic simulations.

preprint2016arXiv

A Discriminative Framework for Anomaly Detection in Large Videos

We address an anomaly detection setting in which training sequences are unavailable and anomalies are scored independently of temporal ordering. Current algorithms in anomaly detection are based on the classical density estimation approach of learning high-dimensional models and finding low-probability events. These algorithms are sensitive to the order in which anomalies appear and require either training data or early context assumptions that do not hold for longer, more complex videos. By defining anomalies as examples that can be distinguished from other examples in the same video, our definition inspires a shift in approaches from classical density estimation to simple discriminative learning. Our contributions include a novel framework for anomaly detection that is (1) independent of temporal ordering of anomalies, and (2) unsupervised, requiring no separate training sequences. We show that our algorithm can achieve state-of-the-art results even when we adjust the setting by removing training sequences from standard datasets.

preprint2016arXiv

Efficient Feature Group Sequencing for Anytime Linear Prediction

We consider \textit{anytime} linear prediction in the common machine learning setting, where features are in groups that have costs. We achieve anytime (or interruptible) predictions by sequencing the computation of feature groups and reporting results using the computed features at interruption. We extend Orthogonal Matching Pursuit (OMP) and Forward Regression (FR) to learn the sequencing greedily under this group setting with costs. We theoretically guarantee that our algorithms achieve near-optimal linear predictions at each budget when a feature group is chosen. With a novel analysis of OMP, we improve its theoretical bound to the same strength as that of FR. In addition, we develop a novel algorithm that consumes cost $4B$ to approximate the optimal performance of \textit{any} cost $B$, and prove that with cost less than $4B$, such an approximation is impossible. To our knowledge, these are the first anytime bounds at \textit{all} budgets. We test our algorithms on two real-world data-sets and evaluate them in terms of anytime linear prediction performance against cost-weighted Group Lasso and alternative greedy algorithms.

preprint2016arXiv

Introspective Perception: Learning to Predict Failures in Vision Systems

As robots aspire for long-term autonomous operations in complex dynamic environments, the ability to reliably take mission-critical decisions in ambiguous situations becomes critical. This motivates the need to build systems that have situational awareness to assess how qualified they are at that moment to make a decision. We call this self-evaluating capability as introspection. In this paper, we take a small step in this direction and propose a generic framework for introspective behavior in perception systems. Our goal is to learn a model to reliably predict failures in a given system, with respect to a task, directly from input sensor data. We present this in the context of vision-based autonomous MAV flight in outdoor natural environments, and show that it effectively handles uncertain situations.

preprint2016arXiv

Learning to Filter with Predictive State Inference Machines

Latent state space models are a fundamental and widely used tool for modeling dynamical systems. However, they are difficult to learn from data and learned models often lack performance guarantees on inference tasks such as filtering and prediction. In this work, we present the PREDICTIVE STATE INFERENCE MACHINE (PSIM), a data-driven method that considers the inference procedure on a dynamical system as a composition of predictors. The key idea is that rather than first learning a latent state space model, and then using the learned model for inference, PSIM directly learns predictors for inference in predictive state space. We provide theoretical guarantees for inference, in both realizable and agnostic settings, and showcase practical performance on a variety of simulated and real world robotics benchmarks.

preprint2016arXiv

Learning Transferable Policies for Monocular Reactive MAV Control

The ability to transfer knowledge gained in previous tasks into new contexts is one of the most important mechanisms of human learning. Despite this, adapting autonomous behavior to be reused in partially similar settings is still an open problem in current robotics research. In this paper, we take a small step in this direction and propose a generic framework for learning transferable motion policies. Our goal is to solve a learning problem in a target domain by utilizing the training data in a different but related source domain. We present this in the context of an autonomous MAV flight using monocular reactive control, and demonstrate the efficacy of our proposed approach through extensive real-world flight experiments in outdoor cluttered environments.

preprint2016arXiv

Robust Monocular Flight in Cluttered Outdoor Environments

Recently, there have been numerous advances in the development of biologically inspired lightweight Micro Aerial Vehicles (MAVs). While autonomous navigation is fairly straight-forward for large UAVs as expensive sensors and monitoring devices can be employed, robust methods for obstacle avoidance remains a challenging task for MAVs which operate at low altitude in cluttered unstructured environments. Due to payload and power constraints, it is necessary for such systems to have autonomous navigation and flight capabilities using mostly passive sensors such as cameras. In this paper, we describe a robust system that enables autonomous navigation of small agile quad-rotors at low altitude through natural forest environments. We present a direct depth estimation approach that is capable of producing accurate, semi-dense depth-maps in real time. Furthermore, a novel wind-resistant control scheme is presented that enables stable way-point tracking even in the presence of strong winds. We demonstrate the performance of our system through extensive experiments on real images and field tests in a cluttered outdoor environment.

preprint2015arXiv

Autonomy Infused Teleoperation with Application to BCI Manipulation

Robot teleoperation systems face a common set of challenges including latency, low-dimensional user commands, and asymmetric control inputs. User control with Brain-Computer Interfaces (BCIs) exacerbates these problems through especially noisy and erratic low-dimensional motion commands due to the difficulty in decoding neural activity. We introduce a general framework to address these challenges through a combination of computer vision, user intent inference, and arbitration between the human input and autonomous control schemes. Adjustable levels of assistance allow the system to balance the operator's capabilities and feelings of comfort and control while compensating for a task's difficulty. We present experimental results demonstrating significant performance improvement using the shared-control assistance framework on adapted rehabilitation benchmarks with two subjects implanted with intracortical brain-computer interfaces controlling a seven degree-of-freedom robotic manipulator as a prosthetic. Our results further indicate that shared assistance mitigates perceived user difficulty and even enables successful performance on previously infeasible tasks. We showcase the extensibility of our architecture with applications to quality-of-life tasks such as opening a door, pouring liquids from containers, and manipulation with novel objects in densely cluttered environments.

preprint2015arXiv

Shared Autonomy via Hindsight Optimization

In shared autonomy, user input and robot autonomy are combined to control a robot to achieve a goal. Often, the robot does not know a priori which goal the user wants to achieve, and must both predict the user's intended goal, and assist in achieving that goal. We formulate the problem of shared autonomy as a Partially Observable Markov Decision Process with uncertainty over the user's goal. We utilize maximum entropy inverse optimal control to estimate a distribution over the user's goal based on the history of inputs. Ideally, the robot assists the user by solving for an action which minimizes the expected cost-to-go for the (unknown) goal. As solving the POMDP to select the optimal action is intractable, we use hindsight optimization to approximate the solution. In a user study, we compare our method to a standard predict-then-blend approach. We find that our method enables users to accomplish tasks more quickly while utilizing less input. However, when asked to rate each system, users were mixed in their assessment, citing a tradeoff between maintaining control authority and accomplishing tasks quickly.

preprint2015arXiv

Visual Chunking: A List Prediction Framework for Region-Based Object Detection

We consider detecting objects in an image by iteratively selecting from a set of arbitrarily shaped candidate regions. Our generic approach, which we term visual chunking, reasons about the locations of multiple object instances in an image while expressively describing object boundaries. We design an optimization criterion for measuring the performance of a list of such detections as a natural extension to a common per-instance metric. We present an efficient algorithm with provable performance for building a high-quality list of detections from any candidate set of region-based proposals. We also develop a simple class-specific algorithm to generate a candidate region instance in near-linear time in the number of low-level superpixels that outperforms other region generating methods. In order to make predictions on novel images at testing time without access to ground truth, we develop learning approaches to emulate these algorithms' behaviors. We demonstrate that our new approach outperforms sophisticated baselines on benchmark datasets.

preprint2014arXiv

A Unified View of Large-scale Zero-sum Equilibrium Computation

The task of computing approximate Nash equilibria in large zero-sum extensive-form games has received a tremendous amount of attention due mainly to the Annual Computer Poker Competition. Immediately after its inception, two competing and seemingly different approaches emerged---one an application of no-regret online learning, the other a sophisticated gradient method applied to a convex-concave saddle-point formulation. Since then, both approaches have grown in relative isolation with advancements on one side not effecting the other. In this paper, we rectify this by dissecting and, in a sense, unify the two views.

preprint2014arXiv

Knapsack Constrained Contextual Submodular List Prediction with Application to Multi-document Summarization

We study the problem of predicting a set or list of options under knapsack constraint. The quality of such lists are evaluated by a submodular reward function that measures both quality and diversity. Similar to DAgger (Ross et al., 2010), by a reduction to online learning, we show how to adapt two sequence prediction models to imitate greedy maximization under knapsack constraint problems: CONSEQOPT (Dey et al., 2012) and SCP (Ross et al., 2013). Experiments on extractive multi-document summarization show that our approach outperforms existing state-of-the-art methods.

preprint2014arXiv

Near Optimal Bayesian Active Learning for Decision Making

How should we gather information to make effective decisions? We address Bayesian active learning and experimental design problems, where we sequentially select tests to reduce uncertainty about a set of hypotheses. Instead of minimizing uncertainty per se, we consider a set of overlapping decision regions of these hypotheses. Our goal is to drive uncertainty into a single decision region as quickly as possible. We identify necessary and sufficient conditions for correctly identifying a decision region that contains all hypotheses consistent with observations. We develop a novel Hyperedge Cutting (HEC) algorithm for this problem, and prove that is competitive with the intractable optimal policy. Our efficient implementation of the algorithm relies on computing subsets of the complete homogeneous symmetric polynomials. Finally, we demonstrate its effectiveness on two practical applications: approximate comparison-based learning and active localization using a robot manipulator.

preprint2014arXiv

Reinforcement and Imitation Learning via Interactive No-Regret Learning

Recent work has demonstrated that problems-- particularly imitation learning and structured prediction-- where a learner's predictions influence the input-distribution it is tested on can be naturally addressed by an interactive approach and analyzed using no-regret online learning. These approaches to imitation learning, however, neither require nor benefit from information about the cost of actions. We extend existing results in two directions: first, we develop an interactive imitation learning approach that leverages cost information; second, we extend the technique to address reinforcement learning. The results provide theoretical support to the commonly observed successes of online approximate policy iteration. Our approach suggests a broad new family of algorithms and provides a unifying view of existing techniques for imitation and reinforcement learning.

preprint2014arXiv

Solving Games with Functional Regret Estimation

We propose a novel online learning method for minimizing regret in large extensive-form games. The approach learns a function approximator online to estimate the regret for choosing a particular action. A no-regret algorithm uses these estimates in place of the true regrets to define a sequence of policies. We prove the approach sound by providing a bound relating the quality of the function approximation and regret of the algorithm. A corollary being that the method is guaranteed to converge to a Nash equilibrium in self-play so long as the regrets are ultimately realizable by the function approximator. Our technique can be understood as a principled generalization of existing work on abstraction in large games; in our work, both the abstraction as well as the equilibrium are learned during self-play. We demonstrate empirically the method achieves higher quality strategies than state-of-the-art abstraction techniques given the same resources.

preprint2014arXiv

Vision and Learning for Deliberative Monocular Cluttered Flight

Cameras provide a rich source of information while being passive, cheap and lightweight for small and medium Unmanned Aerial Vehicles (UAVs). In this work we present the first implementation of receding horizon control, which is widely used in ground vehicles, with monocular vision as the only sensing mode for autonomous UAV flight in dense clutter. We make it feasible on UAVs via a number of contributions: novel coupling of perception and control via relevant and diverse, multiple interpretations of the scene around the robot, leveraging recent advances in machine learning to showcase anytime budgeted cost-sensitive feature selection, and fast non-linear regression for monocular depth prediction. We empirically demonstrate the efficacy of our novel pipeline via real world experiments of more than 2 kms through dense trees with a quadrotor built from off-the-shelf parts. Moreover our pipeline is designed to combine information from other modalities like stereo and lidar as well if available.

preprint2013arXiv

Computational Rationalization: The Inverse Equilibrium Problem

Modeling the purposeful behavior of imperfect agents from a small number of observations is a challenging task. When restricted to the single-agent decision-theoretic setting, inverse optimal control techniques assume that observed behavior is an approximately optimal solution to an unknown decision problem. These techniques learn a utility function that explains the example behavior and can then be used to accurately predict or imitate future behavior in similar observed or unobserved situations. In this work, we consider similar tasks in competitive and cooperative multi-agent domains. Here, unlike single-agent settings, a player cannot myopically maximize its reward; it must speculate on how the other agents may act to influence the game's outcome. Employing the game-theoretic notion of regret and the principle of maximum entropy, we introduce a technique for predicting and generalizing behavior.

preprint2013arXiv

Efficient Touch Based Localization through Submodularity

Many robotic systems deal with uncertainty by performing a sequence of information gathering actions. In this work, we focus on the problem of efficiently constructing such a sequence by drawing an explicit connection to submodularity. Ideally, we would like a method that finds the optimal sequence, taking the minimum amount of time while providing sufficient information. Finding this sequence, however, is generally intractable. As a result, many well-established methods select actions greedily. Surprisingly, this often performs well. Our work first explains this high performance -- we note a commonly used metric, reduction of Shannon entropy, is submodular under certain assumptions, rendering the greedy solution comparable to the optimal plan in the offline setting. However, reacting online to observations can increase performance. Recently developed notions of adaptive submodularity provide guarantees for a greedy algorithm in this online setting. In this work, we develop new methods based on adaptive submodularity for selecting a sequence of information gathering actions online. In addition to providing guarantees, we can capitalize on submodularity to attain additional computational speedups. We demonstrate the effectiveness of these methods in simulation and on a robot.

preprint2013arXiv

Learning Policies for Contextual Submodular Prediction

Many prediction domains, such as ad placement, recommendation, trajectory prediction, and document summarization, require predicting a set or list of options. Such lists are often evaluated using submodular reward functions that measure both quality and diversity. We propose a simple, efficient, and provably near-optimal approach to optimizing such prediction problems based on no-regret learning. Our method leverages a surprising result from online submodular optimization: a single no-regret online learner can compete with an optimal sequence of predictions. Compared to previous work, which either learn a sequence of classifiers or rely on stronger assumptions such as realizability, we ensure both data-efficiency as well as performance guarantees in the fully agnostic setting. Experiments validate the efficiency and applicability of the approach on a wide range of problems including manipulator trajectory optimization, news recommendation and document summarization.

preprint2013arXiv

SpeedMachines: Anytime Structured Prediction

Structured prediction plays a central role in machine learning applications from computational biology to computer vision. These models require significantly more computation than unstructured models, and, in many applications, algorithms may need to make predictions within a computational budget or in an anytime fashion. In this work we propose an anytime technique for learning structured prediction that, at training time, incorporates both structural elements and feature computation trade-offs that affect test-time inference. We apply our technique to the challenging problem of scene understanding in computer vision and demonstrate efficient and anytime predictions that gradually improve towards state-of-the-art classification performance as the allotted time increases.

preprint2012arXiv

Agnostic System Identification for Model-Based Reinforcement Learning

A fundamental problem in control is to learn a model of a system from observations that is useful for controller synthesis. To provide good performance guarantees, existing methods must assume that the real system is in the class of models considered during learning. We present an iterative method with strong guarantees even in the agnostic case where the system is not in the class. In particular, we show that any no-regret online learning algorithm can be used to obtain a near-optimal policy, provided some model achieves low training error and access to a good exploration distribution. Our approach applies to both discrete and continuous domains. We demonstrate its efficacy and scalability on a challenging helicopter domain from the literature.

preprint2012arXiv

Generalized Boosting Algorithms for Convex Optimization

Boosting is a popular way to derive powerful learners from simpler hypothesis classes. Following previous work (Mason et al., 1999; Friedman, 2000) on general boosting frameworks, we analyze gradient-based descent algorithms for boosting with respect to any convex objective and introduce a new measure of weak learner performance into this setting which generalizes existing work. We present the weak to strong learning guarantees for the existing gradient boosting work for strongly-smooth, strongly-convex objectives under this new measure of performance, and also demonstrate that this work fails for non-smooth objectives. To address this issue, we present new algorithms which extend this boosting approach to arbitrary convex loss functions and give corresponding weak to strong convergence results. In addition, we demonstrate experimental results that support our analysis and demonstrate the need for the new algorithms we present.

preprint2012arXiv

Learning Monocular Reactive UAV Control in Cluttered Natural Environments

Autonomous navigation for large Unmanned Aerial Vehicles (UAVs) is fairly straight-forward, as expensive sensors and monitoring devices can be employed. In contrast, obstacle avoidance remains a challenging task for Micro Aerial Vehicles (MAVs) which operate at low altitude in cluttered environments. Unlike large vehicles, MAVs can only carry very light sensors, such as cameras, making autonomous navigation through obstacles much more challenging. In this paper, we describe a system that navigates a small quadrotor helicopter autonomously at low altitude through natural forest environments. Using only a single cheap camera to perceive the environment, we are able to maintain a constant velocity of up to 1.5m/s. Given a small set of human pilot demonstrations, we use recent state-of-the-art imitation learning techniques to train a controller that can avoid trees by adapting the MAVs heading. We demonstrate the performance of our system in a more controlled environment indoors, and in real natural forest environments outdoors.

preprint2012arXiv

Predicting Contextual Sequences via Submodular Function Maximization

Sequence optimization, where the items in a list are ordered to maximize some reward has many applications such as web advertisement placement, search, and control libraries in robotics. Previous work in sequence optimization produces a static ordering that does not take any features of the item or context of the problem into account. In this work, we propose a general approach to order the items within the sequence based on the context (e.g., perceptual information, environment description, and goals). We take a simple, efficient, reduction-based approach where the choice and order of the items is established by repeatedly learning simple classifiers or regressors for each "slot" in the sequence. Our approach leverages recent work on submodular function maximization to provide a formal regret reduction from submodular sequence optimization to simple cost-sensitive prediction. We apply our contextual sequence prediction algorithm to optimize control libraries and demonstrate results on two robotics problems: manipulator trajectory prediction and mobile robot path planning.

preprint2011arXiv

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Sequential prediction problems such as imitation learning, where future observations depend on previous predictions (actions), violate the common i.i.d. assumptions made in statistical learning. This leads to poor performance in theory and often in practice. Some recent approaches provide stronger guarantees in this setting, but remain somewhat unsatisfactory as they train either non-stationary or stochastic policies and require a large number of iterations. In this paper, we propose a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting. We show that any such no regret algorithm, combined with additional reduction assumptions, must find a policy with good performance under the distribution of observations it induces in such sequential settings. We demonstrate that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem.

preprint2011arXiv

Computational Rationalization: The Inverse Equilibrium Problem

Modeling the purposeful behavior of imperfect agents from a small number of observations is a challenging task. When restricted to the single-agent decision-theoretic setting, inverse optimal control techniques assume that observed behavior is an approximately optimal solution to an unknown decision problem. These techniques learn a utility function that explains the example behavior and can then be used to accurately predict or imitate future behavior in similar observed or unobserved situations. In this work, we consider similar tasks in competitive and cooperative multi-agent domains. Here, unlike single-agent settings, a player cannot myopically maximize its reward --- it must speculate on how the other agents may act to influence the game's outcome. Employing the game-theoretic notion of regret and the principle of maximum entropy, we introduce a technique for predicting and generalizing behavior, as well as recovering a reward function in these domains.

preprint2011arXiv

Stability Conditions for Online Learnability

Stability is a general notion that quantifies the sensitivity of a learning algorithm's output to small change in the training dataset (e.g. deletion or replacement of a single training sample). Such conditions have recently been shown to be more powerful to characterize learnability in the general learning setting under i.i.d. samples where uniform convergence is not necessary for learnability, but where stability is both sufficient and necessary for learnability. We here show that similar stability conditions are also sufficient for online learnability, i.e. whether there exists a learning algorithm such that under any sequence of examples (potentially chosen adversarially) produces a sequence of hypotheses that has no regret in the limit with respect to the best hypothesis in hindsight. We introduce online stability, a stability condition related to uniform-leave-one-out stability in the batch setting, that is sufficient for online learnability. In particular we show that popular classes of online learners, namely algorithms that fall in the category of Follow-the-(Regularized)-Leader, Mirror Descent, gradient-based methods and randomized algorithms like Weighted Majority and Hedge, are guaranteed to have no regret if they have such online stability property. We provide examples that suggest the existence of an algorithm with such stability condition might in fact be necessary for online learnability. For the more restricted binary classification setting, we establish that such stability condition is in fact both sufficient and necessary. We also show that for a large class of online learnable problems in the general learning setting, namely those with a notion of sub-exponential covering, no-regret online algorithms that have such stability condition exists.

J. Andrew Bagnell

What is connected

Connect this record

See the researcher in context

Building this map preview

34 published item(s)

Minimax Optimal Online Imitation Learning via Replay Estimation

Sequence Model Imitation Learning with Unobserved Contexts

Causal Imitation Learning under Temporally Correlated Noise

Game-Theoretic Algorithms for Conditional Moment Matching

Feedback in Imitation Learning: The Three Regimes of Covariate Shift

Exploration in Action Space

TRON: A Fast Solver for Trajectory Optimization with Non-Smooth Cost Functions

A Convex Polynomial Force-Motion Model for Planar Sliding: Identification and Application

A Discriminative Framework for Anomaly Detection in Large Videos

Efficient Feature Group Sequencing for Anytime Linear Prediction

Introspective Perception: Learning to Predict Failures in Vision Systems

Learning to Filter with Predictive State Inference Machines

Learning Transferable Policies for Monocular Reactive MAV Control

Robust Monocular Flight in Cluttered Outdoor Environments

Autonomy Infused Teleoperation with Application to BCI Manipulation

Shared Autonomy via Hindsight Optimization

Visual Chunking: A List Prediction Framework for Region-Based Object Detection

A Unified View of Large-scale Zero-sum Equilibrium Computation

Knapsack Constrained Contextual Submodular List Prediction with Application to Multi-document Summarization

Near Optimal Bayesian Active Learning for Decision Making

Reinforcement and Imitation Learning via Interactive No-Regret Learning

Solving Games with Functional Regret Estimation

Vision and Learning for Deliberative Monocular Cluttered Flight

Computational Rationalization: The Inverse Equilibrium Problem

Efficient Touch Based Localization through Submodularity

Learning Policies for Contextual Submodular Prediction

SpeedMachines: Anytime Structured Prediction

Agnostic System Identification for Model-Based Reinforcement Learning

Generalized Boosting Algorithms for Convex Optimization

Learning Monocular Reactive UAV Control in Cluttered Natural Environments

Predicting Contextual Sequences via Submodular Function Maximization

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Computational Rationalization: The Inverse Equilibrium Problem

Stability Conditions for Online Learnability