Online Learning of Non-Markovian Reward Models
- URL: http://arxiv.org/abs/2009.12600v2
- Date: Wed, 30 Sep 2020 08:56:39 GMT
- Title: Online Learning of Non-Markovian Reward Models
- Authors: Gavin Rens, Jean-Fran\c{c}ois Raskin, Rapha\"el Reynouad, Giuseppe
Marra
- Abstract summary: We consider a non-Markovian reward decision process (MDP) that models the dynamics of the environment in which the agent evolves.
While the MDP is known by the agent, the reward function is unknown to the agent and must be learned.
We use Angluin's $L*$ active learning algorithm to learn a Mealy machine representing the underlying non-Markovian reward machine.
- Score: 2.064612766965483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There are situations in which an agent should receive rewards only after
having accomplished a series of previous tasks, that is, rewards are
non-Markovian. One natural and quite general way to represent history-dependent
rewards is via a Mealy machine, a finite state automaton that produces output
sequences from input sequences. In our formal setting, we consider a Markov
decision process (MDP) that models the dynamics of the environment in which the
agent evolves and a Mealy machine synchronized with this MDP to formalize the
non-Markovian reward function. While the MDP is known by the agent, the reward
function is unknown to the agent and must be learned.
Our approach to overcome this challenge is to use Angluin's $L^*$ active
learning algorithm to learn a Mealy machine representing the underlying
non-Markovian reward machine (MRM). Formal methods are used to determine the
optimal strategy for answering so-called membership queries posed by $L^*$.
Moreover, we prove that the expected reward achieved will eventually be at
least as much as a given, reasonable value provided by a domain expert. We
evaluate our framework on three problems. The results show that using $L^*$ to
learn an MRM in a non-Markovian reward decision process is effective.
Related papers
- Process Reinforcement through Implicit Rewards [95.7442934212076]
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs)
Dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards.
This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive.
We propose PRIME, which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards
arXiv Detail & Related papers (2025-02-03T15:43:48Z) - Free Process Rewards without Process Labels [55.14044050782222]
We show that an textitimplicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels.
We show that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction.
arXiv Detail & Related papers (2024-12-02T21:20:02Z) - Efficient Reinforcement Learning in Probabilistic Reward Machines [15.645062155050732]
We design an algorithm for Probabilistic Reward Machines (PRMs) that achieves a regret bound of $widetildeO(sqrtHOAT)$.
We also present a new simulation lemma for non-Markovian rewards, which enables reward-free exploration for any non-Markovian reward.
arXiv Detail & Related papers (2024-08-19T19:51:53Z) - Detecting Hidden Triggers: Mapping Non-Markov Reward Functions to Markov [2.486161976966064]
This paper proposes a framework for mapping non-Markov reward functions into equivalent Markov ones by learning Reward Machines.
Unlike the general practice of learning Reward Machines, we do not require a set of high-level propositional symbols from which to learn.
We empirically validate our approach by learning black-box, non-Markov reward functions in the Officeworld domain.
arXiv Detail & Related papers (2024-01-20T21:09:27Z) - STARC: A General Framework For Quantifying Differences Between Reward Functions [52.69620361363209]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics.
We show that STARC metrics induce both an upper and a lower bound on worst-case regret.
We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Reinforcement Learning in Reward-Mixing MDPs [74.41782017817808]
episodic reinforcement learning in a reward-mixing Markov decision process (MDP)
cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
epsilon$-optimal policy after exploring $tildeO(poly(H,epsilon-1) cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
arXiv Detail & Related papers (2021-10-07T18:55:49Z) - Learning Probabilistic Reward Machines from Non-Markovian Stochastic
Reward Processes [8.800797834097764]
We introduce probabilistic reward machines (PRMs) as a representation of non-Markovian rewards.
We present an algorithm to learn PRMs from the underlying decision process as well as to learn the PRM representation of a given decision-making policy.
arXiv Detail & Related papers (2021-07-09T19:00:39Z) - Model-Augmented Q-learning [112.86795579978802]
We propose a MFRL framework that is augmented with the components of model-based RL.
Specifically, we propose to estimate not only the $Q$-values but also both the transition and the reward with a shared network.
We show that the proposed scheme, called Model-augmented $Q$-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward.
arXiv Detail & Related papers (2021-02-07T17:56:50Z) - Learning and Solving Regular Decision Processes [15.533842336139067]
Regular Decision Processes (RDPs) are a recently introduced model that extends MDPs with non-Markovian dynamics and rewards.
We build on automata learning techniques with history clustering to learn such a Mealy machine and solve it by adapting MCTS.
arXiv Detail & Related papers (2020-03-02T16:36:16Z) - Learning Non-Markovian Reward Models in MDPs [0.0]
We show how to formalise the non-Markovian reward function using a Mealy machine.
In our formal setting, we consider a Markov decision process (MDP) that models the dynamic of the environment in which the agent evolves.
While the MDP is known by the agent, the reward function is unknown from the agent and must be learnt.
arXiv Detail & Related papers (2020-01-25T10:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.