Online Learning of Non-Markovian Reward Models
- URL: http://arxiv.org/abs/2009.12600v2
- Date: Wed, 30 Sep 2020 08:56:39 GMT
- Title: Online Learning of Non-Markovian Reward Models
- Authors: Gavin Rens, Jean-Fran\c{c}ois Raskin, Rapha\"el Reynouad, Giuseppe
Marra
- Abstract summary: We consider a non-Markovian reward decision process (MDP) that models the dynamics of the environment in which the agent evolves.
While the MDP is known by the agent, the reward function is unknown to the agent and must be learned.
We use Angluin's $L*$ active learning algorithm to learn a Mealy machine representing the underlying non-Markovian reward machine.
- Score: 2.064612766965483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There are situations in which an agent should receive rewards only after
having accomplished a series of previous tasks, that is, rewards are
non-Markovian. One natural and quite general way to represent history-dependent
rewards is via a Mealy machine, a finite state automaton that produces output
sequences from input sequences. In our formal setting, we consider a Markov
decision process (MDP) that models the dynamics of the environment in which the
agent evolves and a Mealy machine synchronized with this MDP to formalize the
non-Markovian reward function. While the MDP is known by the agent, the reward
function is unknown to the agent and must be learned.
Our approach to overcome this challenge is to use Angluin's $L^*$ active
learning algorithm to learn a Mealy machine representing the underlying
non-Markovian reward machine (MRM). Formal methods are used to determine the
optimal strategy for answering so-called membership queries posed by $L^*$.
Moreover, we prove that the expected reward achieved will eventually be at
least as much as a given, reasonable value provided by a domain expert. We
evaluate our framework on three problems. The results show that using $L^*$ to
learn an MRM in a non-Markovian reward decision process is effective.
Related papers
- Efficient Reinforcement Learning in Probabilistic Reward Machines [15.645062155050732]
We design an algorithm for Probabilistic Reward Machines (PRMs) that achieves a regret bound of $widetildeO(sqrtHOAT)$.
We also present a new simulation lemma for non-Markovian rewards, which enables reward-free exploration for any non-Markovian reward.
arXiv Detail & Related papers (2024-08-19T19:51:53Z) - Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
Key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood.
We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight.
arXiv Detail & Related papers (2024-07-15T17:59:52Z) - Detecting Hidden Triggers: Mapping Non-Markov Reward Functions to Markov [2.486161976966064]
This paper proposes a framework for mapping non-Markov reward functions into equivalent Markov ones by learning Reward Machines.
Unlike the general practice of learning Reward Machines, we do not require a set of high-level propositional symbols from which to learn.
We empirically validate our approach by learning black-box, non-Markov reward functions in the Officeworld domain.
arXiv Detail & Related papers (2024-01-20T21:09:27Z) - STARC: A General Framework For Quantifying Differences Between Reward
Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics.
We show that STARC metrics induce both an upper and a lower bound on worst-case regret.
We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Learning Dynamic Mechanisms in Unknown Environments: A Reinforcement
Learning Approach [130.9259586568977]
We propose novel learning algorithms to recover the dynamic Vickrey-Clarke-Grove (VCG) mechanism over multiple rounds of interaction.
A key contribution of our approach is incorporating reward-free online Reinforcement Learning (RL) to aid exploration over a rich policy space.
arXiv Detail & Related papers (2022-02-25T16:17:23Z) - Reinforcement Learning in Reward-Mixing MDPs [74.41782017817808]
episodic reinforcement learning in a reward-mixing Markov decision process (MDP)
cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
epsilon$-optimal policy after exploring $tildeO(poly(H,epsilon-1) cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
arXiv Detail & Related papers (2021-10-07T18:55:49Z) - Learning Probabilistic Reward Machines from Non-Markovian Stochastic
Reward Processes [8.800797834097764]
We introduce probabilistic reward machines (PRMs) as a representation of non-Markovian rewards.
We present an algorithm to learn PRMs from the underlying decision process as well as to learn the PRM representation of a given decision-making policy.
arXiv Detail & Related papers (2021-07-09T19:00:39Z) - Model-Augmented Q-learning [112.86795579978802]
We propose a MFRL framework that is augmented with the components of model-based RL.
Specifically, we propose to estimate not only the $Q$-values but also both the transition and the reward with a shared network.
We show that the proposed scheme, called Model-augmented $Q$-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward.
arXiv Detail & Related papers (2021-02-07T17:56:50Z) - Learning and Solving Regular Decision Processes [15.533842336139067]
Regular Decision Processes (RDPs) are a recently introduced model that extends MDPs with non-Markovian dynamics and rewards.
We build on automata learning techniques with history clustering to learn such a Mealy machine and solve it by adapting MCTS.
arXiv Detail & Related papers (2020-03-02T16:36:16Z) - Learning Non-Markovian Reward Models in MDPs [0.0]
We show how to formalise the non-Markovian reward function using a Mealy machine.
In our formal setting, we consider a Markov decision process (MDP) that models the dynamic of the environment in which the agent evolves.
While the MDP is known by the agent, the reward function is unknown from the agent and must be learnt.
arXiv Detail & Related papers (2020-01-25T10:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.