Researcher profile

Ryohei Oura

Ryohei Oura contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 15 - UnverifiedVerification L1Unclaimed author
3works
0followers
6topics
2close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

3 published item(s)

preprint2022arXiv

Bounded Synthesis and Reinforcement Learning of Supervisors for Stochastic Discrete Event Systems with LTL Specifications

In this paper, we consider supervisory control of stochastic discrete event systems (SDESs) under linear temporal logic specifications. Applying the bounded synthesis, we reduce the supervisor synthesis into a problem of satisfying a safety condition. First, we consider a synthesis problem of a directed controller using the safety condition. We assign a negative reward to the unsafe states and introduce an expected return with a state-dependent discount factor. We compute a winning region and a directed controller with the maximum satisfaction probability using a dynamic programming method, where the expected return is used as a value function. Next, we construct a permissive supervisor via the optimal value function. We show that the supervisor accomplishes the maximum satisfaction probability and maximizes the reachable set within the winning region. Finally, for an unknown SDES, we propose a two-stage model-free reinforcement learning method for efficient learning of the winning region and the directed controllers with the maximum satisfaction probability. We also demonstrate the effectiveness of the proposed method by simulation.

preprint2022arXiv

Learning-based Bounded Synthesis for Semi-MDPs with LTL Specifications

This letter proposes a learning-based bounded synthesis for a semi-Markov decision process (SMDP) with a linear temporal logic (LTL) specification. In the product of the SMDP and the deterministic $K$-co-Büchi automaton (d$K$cBA) converted from the LTL specification, we learn both the winning region of satisfying the LTL specification and the dynamics therein based on reinforcement learning and Bayesian inference. Then, we synthesize an optimal policy satisfying the following two conditions. (1) It maximizes the probability of reaching the wining region. (2) It minimizes a long-term risk for the dwell time within the winning region. The minimization of the long-term risk is done based on the estimated dynamics and a value iteration. We show that, if the discount factor is sufficiently close to one, the synthesized policy converges to the optimal policy as the number of the data obtained by the exploration goes to the infinity.

preprint2020arXiv

Reinforcement Learning of Control Policy for Linear Temporal Logic Specifications Using Limit-Deterministic Generalized Büchi Automata

This letter proposes a novel reinforcement learning method for the synthesis of a control policy satisfying a control specification described by a linear temporal logic formula. We assume that the controlled system is modeled by a Markov decision process (MDP). We convert the specification to a limit-deterministic generalized Büchi automaton (LDGBA) with several accepting sets that accepts all infinite sequences satisfying the formula. The LDGBA is augmented so that it explicitly records the previous visits to accepting sets. We take a product of the augmented LDGBA and the MDP, based on which we define a reward function. The agent gets rewards whenever state transitions are in an accepting set that has not been visited for a certain number of steps. Consequently, sparsity of rewards is relaxed and optimal circulations among the accepting sets are learned. We show that the proposed method can learn an optimal policy when the discount factor is sufficiently close to one.