Source author record

Marek Petrik

Marek Petrik appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence math.OC astro-ph.IM Computer Science and Game Theory physics.space-ph q-fin.PM q-fin.RM

Catalog footprint

What is connected

19works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

Optimizing Percentile Criterion Using Robust MDPs

We address the problem of computing reliable policies in reinforcement learning problems with limited data. In particular, we compute policies that achieve good returns with high confidence when deployed. This objective, known as the \emph{percentile criterion}, can be optimized using Robust MDPs~(RMDPs). RMDPs generalize MDPs to allow for uncertain transition probabilities chosen adversarially from given ambiguity sets. We show that the RMDP solution's sub-optimality depends on the spans of the ambiguity sets along the value function. We then propose new algorithms that minimize the span of ambiguity sets defined by weighted $L_1$ and $L_\infty$ norms. Our primary focus is on Bayesian guarantees, but we also describe how our methods apply to frequentist guarantees and derive new concentration inequalities for weighted $L_1$ and $L_\infty$ norms. Experimental results indicate that our optimized ambiguity sets improve significantly on prior construction methods.

preprint2021arXiv

Robust Maximum Entropy Behavior Cloning

Imitation learning (IL) algorithms use expert demonstrations to learn a specific task. Most of the existing approaches assume that all expert demonstrations are reliable and trustworthy, but what if there exist some adversarial demonstrations among the given data-set? This may result in poor decision-making performance. We propose a novel general frame-work to directly generate a policy from demonstrations that autonomously detect the adversarial demonstrations and exclude them from the data set. At the same time, it's sample, time-efficient, and does not require a simulator. To model such adversarial demonstration we propose a min-max problem that leverages the entropy of the model to assign weights for each demonstration. This allows us to learn the behavior using only the correct demonstrations or a mixture of correct demonstrations.

preprint2021arXiv

Soft-Robust Algorithms for Batch Reinforcement Learning

In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the percentile criterion, which minimizes the probability of a catastrophic failure. Unfortunately, such policies are typically overly conservative as the percentile criterion is non-convex, difficult to optimize, and ignores the mean performance. To overcome these shortcomings, we study the soft-robust criterion, which uses risk measures to balance the mean and percentile criterion better. In this paper, we establish the soft-robust criterion's fundamental properties, show that it is NP-hard to optimize, and propose and analyze two algorithms to approximately optimize it. Our theoretical analyses and empirical evaluations demonstrate that our algorithms compute much less conservative solutions than the existing approximate methods for optimizing the percentile-criterion.

preprint2020arXiv

Entropic Risk Constrained Soft-Robust Policy Optimization

Having a perfect model to compute the optimal policy is often infeasible in reinforcement learning. It is important in high-stakes domains to quantify and manage risk induced by model uncertainties. Entropic risk measure is an exponential utility-based convex risk measure that satisfies many reasonable properties. In this paper, we propose an entropic risk constrained policy gradient and actor-critic algorithms that are risk-averse to the model uncertainty. We demonstrate the usefulness of our algorithms on several problem domains.

preprint2020arXiv

Finite-Sample Analysis of Proximal Gradient TD Algorithms

In this paper, we analyze the convergence rate of the gradient temporal difference learning (GTD) family of algorithms. Previous analyses of this class of algorithms use ODE techniques to prove asymptotic convergence, and to the best of our knowledge, no finite-sample analysis has been done. Moreover, there has been not much work on finite-sample analysis for convergent off-policy reinforcement learning algorithms. In this paper, we formulate GTD methods as stochastic gradient algorithms w.r.t.~a primal-dual saddle-point objective function, and then conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Two revised algorithms are also proposed, namely projected GTD2 and GTD2-MP, which offer improved convergence guarantees and acceleration, respectively. The results of our theoretical analysis show that the GTD family of algorithms are indeed comparable to the existing LSTD methods in off-policy learning scenarios.

preprint2020arXiv

MMS SITL Ground Loop: Automating the burst data selection process

Global-scale energy flow throughout Earth's magnetosphere (MSP) is catalyzed by processes that occur at Earth's magnetopause (MP). Magnetic reconnection is one process responsible for solar wind entry into and global convection within the MSP, and the MP location, orientation, and motion have an impact on the dynamics. Statistical studies that focus on these and other MP phenomena and characteristics inherently require MP identification in their event search criteria, a task that can be automated using machine learning. We introduce a Long-Short Term Memory (LSTM) Recurrent Neural Network model to detect MP crossings and assist studies of energy transfer into the MSP. As its first application, the LSTM has been implemented into the operational data stream of the Magnetospheric Multiscale (MMS) mission. MMS focuses on the electron diffusion region of reconnection, where electron dynamics break magnetic field lines and plasma is energized. MMS employs automated burst triggers onboard the spacecraft and a Scientist-in-the-Loop (SITL) on the ground to select intervals likely to contain diffusion regions. Only low-resolution data is available to the SITL, which is insufficient to resolve electron dynamics. A strategy for the SITL, then, is to select all MP crossings. Of all 219 SITL selections classified as MP crossings during the first five months of model operations, the model predicted 166 (76%) of them, and of all 360 model predictions, 257 (71%) were selected by the SITL. Most predictions that were not classified as MP crossings by the SITL were still MP-like; the intervals contained mixed magnetosheath and magnetospheric plasmas. The LSTM model and its predictions are public to ease the burden of arduous event searches involving the MP, including those for EDRs. For MMS, this helps free up mission operation costs by consolidating manual classification processes into automated routines.

preprint2020arXiv

Partial Policy Iteration for L1-Robust Markov Decision Processes

Robust Markov decision processes (MDPs) allow to compute reliable solutions for dynamic decision problems whose evolution is modeled by rewards and partially-known transition probabilities. Unfortunately, accounting for uncertainty in the transition probabilities significantly increases the computational complexity of solving robust MDPs, which severely limits their scalability. This paper describes new efficient algorithms for solving the common class of robust MDPs with s- and sa-rectangular ambiguity sets defined by weighted $L_1$ norms. We propose partial policy iteration, a new, efficient, flexible, and general policy iteration scheme for robust MDPs. We also propose fast methods for computing the robust Bellman operator in quasi-linear time, nearly matching the linear complexity the non-robust Bellman operator. Our experimental results indicate that the proposed methods are many orders of magnitude faster than the state-of-the-art approach which uses linear programming solvers combined with a robust value iteration.

preprint2020arXiv

Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity

In this paper, we introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function. We also conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and do not provide any finite-sample analysis. We also propose an accelerated algorithm, called GTD2-MP, that uses proximal ``mirror maps'' to yield an improved convergence rate. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

preprint2016arXiv

Building an Interpretable Recommender via Loss-Preserving Transformation

We propose a method for building an interpretable recommender system for personalizing online content and promotions. Historical data available for the system consists of customer features, provided content (promotions), and user responses. Unlike in a standard multi-class classification setting, misclassification costs depend on both recommended actions and customers. Our method transforms such a data set to a new set which can be used with standard interpretable multi-class classification algorithms. The transformation has the desirable property that minimizing the standard misclassification penalty in this new space is equivalent to minimizing the custom cost function.

preprint2016arXiv

Safe Policy Improvement by Minimizing Robust Baseline Regret

An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, i.e., a policy that is guaranteed to perform at least as well as a given baseline strategy. In this paper, we develop and analyze a new model-based approach to compute a safe policy when we have access to an inaccurate dynamics model of the system with known accuracy guarantees. Our proposed robust method uses this (inaccurate) model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to the existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NP-hard and propose an approximate algorithm. Our empirical results on several domains show that even this relatively simple approximate algorithm can significantly outperform standard approaches.

preprint2015arXiv

Robust Partially-Compressed Least-Squares

Randomized matrix compression techniques, such as the Johnson-Lindenstrauss transform, have emerged as an effective and practical way for solving large-scale problems efficiently. With a focus on computational efficiency, however, forsaking solutions quality and accuracy becomes the trade-off. In this paper, we investigate compressed least-squares problems and propose new models and algorithms that address the issue of error and noise introduced by compression. While maintaining computational efficiency, our models provide robust solutions that are more accurate--relative to solutions of uncompressed least-squares--than those of classical compressed variants. We introduce tools from robust optimization together with a form of partial compression to improve the error-time trade-offs of compressed least-squares solvers. We develop an efficient solution algorithm for our Robust Partially-Compressed (RPC) model based on a reduction to a one-dimensional search. We also derive the first approximation error bounds for Partially-Compressed least-squares solutions. Empirical results comparing numerous alternatives suggest that robust and partially compressed solutions are effectively insulated against aggressive randomized transforms.

preprint2015arXiv

Robust Policy Optimization with Baseline Guarantees

Our goal is to compute a policy that guarantees improved return over a baseline policy even when the available MDP model is inaccurate. The inaccurate model may be constructed, for example, by system identification techniques when the true model is inaccessible. When the modeling error is large, the standard solution to the constructed model has no performance guarantees with respect to the true model. In this paper we develop algorithms that provide such performance guarantees and show a trade-off between their complexity and conservatism. Our novel model-based safe policy search algorithms leverage recent advances in robust optimization techniques. Furthermore we illustrate the effectiveness of these algorithms using a numerical example.

preprint2014arXiv

A Bilinear Programming Approach for Multiagent Planning

Multiagent planning and coordination problems are common and known to be computationally hard. We show that a wide range of two-agent problems can be formulated as bilinear programs. We present a successive approximation algorithm that significantly outperforms the coverage set algorithm, which is the state-of-the-art method for this class of multiagent problems. Because the algorithm is formulated for bilinear programs, it is more general and simpler to implement. The new algorithm can be terminated at any time and-unlike the coverage set algorithm-it facilitates the derivation of a useful online performance bound. It is also much more efficient, on average reducing the computation time of the optimal solution by about four orders of magnitude. Finally, we introduce an automatic dimensionality reduction method that improves the effectiveness of the algorithm, extending its applicability to new domains and providing a new way to analyze a subclass of bilinear programs.

preprint2013arXiv

Solution Methods for Constrained Markov Decision Process with Continuous Probability Modulation

We propose solution methods for previously-unsolved constrained MDPs in which actions can continuously modify the transition probabilities within some acceptable sets. While many methods have been proposed to solve regular MDPs with large state sets, there are few practical approaches for solving constrained MDPs with large action sets. In particular, we show that the continuous action sets can be replaced by their extreme points when the rewards are linear in the modulation. We also develop a tractable optimization formulation for concave reward functions and, surprisingly, also extend it to non- concave reward functions by using their concave envelopes. We evaluate the effectiveness of the approach on the problem of managing delinquencies in a portfolio of loans.

preprint2013arXiv

Tight Approximations of Dynamic Risk Measures

This paper compares two different frameworks recently introduced in the literature for measuring risk in a multi-period setting. The first corresponds to applying a single coherent risk measure to the cumulative future costs, while the second involves applying a composition of one-step coherent risk mappings. We summarize the relative strengths of the two methods, characterize several necessary and sufficient conditions under which one of the measurements always dominates the other, and introduce a metric to quantify how close the two risk measures are. Using this notion, we address the question of how tightly a given coherent measure can be approximated by lower or upper-bounding compositional measures. We exhibit an interesting asymmetry between the two cases: the tightest possible upper-bound can be exactly characterized, and corresponds to a popular construction in the literature, while the tightest-possible lower bound is not readily available. We show that testing domination and computing the approximation factors is generally NP-hard, even when the risk measures in question are comonotonic and law-invariant. However, we characterize conditions and discuss several examples where polynomial-time algorithms are possible. One such case is the well-known Conditional Value-at-Risk measure, which is further explored in our companion paper [Huang, Iancu, Petrik and Subramanian, "Static and Dynamic Conditional Value at Risk" (2012)]. Our theoretical and algorithmic constructions exploit interesting connections between the study of risk measures and the theory of submodularity and combinatorial optimization, which may be of independent interest.

preprint2012arXiv

An Approximate Solution Method for Large Risk-Averse Markov Decision Processes

Stochastic domains often involve risk-averse decision makers. While recent work has focused on how to model risk in Markov decision processes using risk measures, it has not addressed the problem of solving large risk-averse formulations. In this paper, we propose and analyze a new method for solving large risk-averse MDPs with hybrid continuous-discrete state spaces and continuous action spaces. The proposed method iteratively improves a bound on the value function using a linearity structure of the MDP. We demonstrate the utility and properties of the method on a portfolio optimization problem.

preprint2012arXiv

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds

Approximate dynamic programming is a popular method for solving large Markov decision processes. This paper describes a new class of approximate dynamic programming (ADP) methods- distributionally robust ADP-that address the curse of dimensionality by minimizing a pessimistic bound on the policy loss. This approach turns ADP into an optimization problem, for which we derive new mathematical program formulations and analyze its properties. DRADP improves on the theoretical guarantees of existing ADP methods-it guarantees convergence and L1 norm based error bounds. The empirical evaluation of DRADP shows that the theoretical guarantees translate well into good performance on benchmark problems.

preprint2010arXiv

Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes

Approximate dynamic programming has been used successfully in a large variety of domains, but it relies on a small set of provided approximation features to calculate solutions reliably. Large and rich sets of features can cause existing algorithms to overfit because of a limited number of samples. We address this shortcoming using $L_1$ regularization in approximate linear programming. Because the proposed method can automatically select the appropriate richness of features, its performance does not degrade with an increasing number of features. These results rely on new and stronger sampling bounds for regularized approximate linear programs. We also propose a computationally efficient homotopy method. The empirical evaluation of the approach shows that the proposed method performs well on simple MDPs and standard benchmark problems.

preprint2010arXiv

Global Optimization for Value Function Approximation

Existing value function approximation methods have been successfully used in many applications, but they often lack useful a priori error bounds. We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. The formulation provides strong a priori guarantees on both robust and expected policy loss by minimizing specific norms of the Bellman residual. Solving a bilinear program optimally is NP-hard, but this is unavoidable because the Bellman-residual minimization itself is NP-hard. We describe and analyze both optimal and approximate algorithms for solving bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. We also briefly analyze the behavior of bilinear programming algorithms under incomplete samples. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on simple benchmark problems.

Marek Petrik

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

Optimizing Percentile Criterion Using Robust MDPs

Robust Maximum Entropy Behavior Cloning

Soft-Robust Algorithms for Batch Reinforcement Learning

Entropic Risk Constrained Soft-Robust Policy Optimization

Finite-Sample Analysis of Proximal Gradient TD Algorithms

MMS SITL Ground Loop: Automating the burst data selection process

Partial Policy Iteration for L1-Robust Markov Decision Processes

Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity

Building an Interpretable Recommender via Loss-Preserving Transformation

Safe Policy Improvement by Minimizing Robust Baseline Regret

Robust Partially-Compressed Least-Squares

Robust Policy Optimization with Baseline Guarantees

A Bilinear Programming Approach for Multiagent Planning

Solution Methods for Constrained Markov Decision Process with Continuous Probability Modulation

Tight Approximations of Dynamic Risk Measures

An Approximate Solution Method for Large Risk-Averse Markov Decision Processes

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds

Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes

Global Optimization for Value Function Approximation