Researcher profile

Mateo Perez

Mateo Perez contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 15 - UnverifiedVerification L1Unclaimed author
3works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

3 published item(s)

preprint2022arXiv

Alternating Good-for-MDP Automata

When omega-regular objectives were first proposed in model-free reinforcement learning (RL) for controlling MDPs, deterministic Rabin automata were used in an attempt to provide a direct translation from their transitions to scalar values. While these translations failed, it has turned out that it is possible to repair them by using good-for-MDPs (GFM) Büchi automata instead. These are nondeterministic Büchi automata with a restricted type of nondeterminism, albeit not as restricted as in good-for-games automata. Indeed, deterministic Rabin automata have a pretty straightforward translation to such GFM automata, which is bi-linear in the number of states and pairs. Interestingly, the same cannot be said for deterministic Streett automata: a translation to nondeterministic Rabin or Büchi automata comes at an exponential cost, even without requiring the target automaton to be good-for-MDPs. Do we have to pay more than that to obtain a good-for-MDP automaton? The surprising answer is that we have to pay significantly less when we instead expand the good-for-MDP property to alternating automata: like the nondeterministic GFM automata obtained from deterministic Rabin automata, the alternating good-for-MDP automata we produce from deterministic Streett automata are bi-linear in the the size of the deterministic automaton and its index, and can therefore be exponentially more succinct than minimal nondeterministic Büchi automata.

preprint2022arXiv

Compositional Reinforcement Learning for Discrete-Time Stochastic Control Systems

We propose a compositional approach to synthesize policies for networks of continuous-space stochastic control systems with unknown dynamics using model-free reinforcement learning (RL). The approach is based on implicitly abstracting each subsystem in the network with a finite Markov decision process with unknown transition probabilities, synthesizing a strategy for each abstract model in an assume-guarantee fashion using RL, and then mapping the results back over the original network with approximate optimality guarantees. We provide lower bounds on the satisfaction probability of the overall network based on those over individual subsystems. A key contribution is to leverage the convergence results for adversarial RL (minimax Q-learning) on finite stochastic arenas to provide control strategies maximizing the probability of satisfaction over the network of continuous-space systems. We consider finite-horizon properties expressed in the syntactically co-safe fragment of linear temporal logic. These properties can readily be converted into automata-based reward functions, providing scalar reward signals suitable for RL. Since such reward functions are often sparse, we supply a potential-based reward shaping technique to accelerate learning by producing dense rewards. The effectiveness of the proposed approaches is demonstrated via two physical benchmarks including regulation of a room temperature network and control of a road traffic network.

preprint2022arXiv

Recursive Reinforcement Learning

Recursion is the fundamental paradigm to finitely describe potentially infinite objects. As state-of-the-art reinforcement learning (RL) algorithms cannot directly reason about recursion, they must rely on the practitioner's ingenuity in designing a suitable "flat" representation of the environment. The resulting manual feature constructions and approximations are cumbersome and error-prone; their lack of transparency hampers scalability. To overcome these challenges, we develop RL algorithms capable of computing optimal policies in environments described as a collection of Markov decision processes (MDPs) that can recursively invoke one another. Each constituent MDP is characterized by several entry and exit points that correspond to input and output values of these invocations. These recursive MDPs (or RMDPs) are expressively equivalent to probabilistic pushdown systems (with call-stack playing the role of the pushdown stack), and can model probabilistic programs with recursive procedural calls. We introduce Recursive Q-learning -- a model-free RL algorithm for RMDPs -- and prove that it converges for finite, single-exit and deterministic multi-exit RMDPs under mild assumptions.