Learning Reward Machines from Partially Observed Optimal Policies
- URL: http://arxiv.org/abs/2502.03762v1
- Date: Thu, 06 Feb 2025 03:48:25 GMT
- Title: Learning Reward Machines from Partially Observed Optimal Policies
- Authors: Mohamad Louai Shehab, Antoine Aspeel, Necmiye Ozay,
- Abstract summary: Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy.
Our goal is to identify the true reward machine using finite information.
- Score: 0.40964539027092917
- License:
- Abstract: Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a sufficient (but finite) depth, our algorithm recovers the exact reward machine up to the equivalence class. This sufficient depth is derived as a function of the number of MDP states and (an upper bound on) the number of states of the reward machine. Several examples are used to demonstrate the effectiveness of the approach.
Related papers
- Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures [1.75493501156941]
We introduce a new maximum entropy reinforcement learning framework based on the distribution of states and actions visited by a policy.
For each state and action, this intrinsic reward is the relative entropy of the discounted distribution of states and actions visited during the next time steps.
arXiv Detail & Related papers (2024-12-09T16:56:06Z) - Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
Key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood.
We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight.
arXiv Detail & Related papers (2024-07-15T17:59:52Z) - On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - STARC: A General Framework For Quantifying Differences Between Reward Functions [52.69620361363209]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics.
We show that STARC metrics induce both an upper and a lower bound on worst-case regret.
We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Reinforcement Learning in Reward-Mixing MDPs [74.41782017817808]
episodic reinforcement learning in a reward-mixing Markov decision process (MDP)
cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
epsilon$-optimal policy after exploring $tildeO(poly(H,epsilon-1) cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
arXiv Detail & Related papers (2021-10-07T18:55:49Z) - Simplified Belief-Dependent Reward MCTS Planning with Guaranteed Tree
Consistency [11.688030627514532]
Partially Observable Markov Decision Processes (POMDPs) are notoriously hard to solve.
Most advanced state-of-the-art online solvers leverage ideas of Monte Carlo Tree Search (MCTS)
We present a novel variant to the MCTS algorithm that considers information-theoretic rewards but avoids the need to calculate them completely.
arXiv Detail & Related papers (2021-05-29T07:25:11Z) - Online Learning of Non-Markovian Reward Models [2.064612766965483]
We consider a non-Markovian reward decision process (MDP) that models the dynamics of the environment in which the agent evolves.
While the MDP is known by the agent, the reward function is unknown to the agent and must be learned.
We use Angluin's $L*$ active learning algorithm to learn a Mealy machine representing the underlying non-Markovian reward machine.
arXiv Detail & Related papers (2020-09-26T13:54:34Z) - Learning Non-Markovian Reward Models in MDPs [0.0]
We show how to formalise the non-Markovian reward function using a Mealy machine.
In our formal setting, we consider a Markov decision process (MDP) that models the dynamic of the environment in which the agent evolves.
While the MDP is known by the agent, the reward function is unknown from the agent and must be learnt.
arXiv Detail & Related papers (2020-01-25T10:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.