Maximally Permissive Reward Machines
- URL: http://arxiv.org/abs/2408.08059v1
- Date: Thu, 15 Aug 2024 09:59:26 GMT
- Title: Maximally Permissive Reward Machines
- Authors: Giovanni Varricchione, Natasha Alechina, Mehdi Dastani, Brian Logan,
- Abstract summary: We propose a new approach to synthesising reward machines based on the set of partial order plans for a goal.
We prove that learning using such "maximally permissive" reward machines results in higher rewards than learning using RMs based on a single plan.
- Score: 8.425937972214667
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reward machines allow the definition of rewards for temporally extended tasks and behaviors. Specifying "informative" reward machines can be challenging. One way to address this is to generate reward machines from a high-level abstract description of the learning environment, using techniques such as AI planning. However, previous planning-based approaches generate a reward machine based on a single (sequential or partial-order) plan, and do not allow maximum flexibility to the learning agent. In this paper we propose a new approach to synthesising reward machines which is based on the set of partial order plans for a goal. We prove that learning using such "maximally permissive" reward machines results in higher rewards than learning using RMs based on a single plan. We present experimental results which support our theoretical claims by showing that our approach obtains higher rewards than the single-plan approach in practice.
Related papers
- Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
Key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood.
We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight.
arXiv Detail & Related papers (2024-07-15T17:59:52Z) - Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z) - DreamSmooth: Improving Model-based Reinforcement Learning via Reward
Smoothing [60.21269454707625]
DreamSmooth learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep.
We show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks.
arXiv Detail & Related papers (2023-11-02T17:57:38Z) - Reward Collapse in Aligning Large Language Models [64.98482888193267]
We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution.
Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
arXiv Detail & Related papers (2023-05-28T02:12:00Z) - Unpacking Reward Shaping: Understanding the Benefits of Reward
Engineering on Sample Complexity [114.88145406445483]
Reinforcement learning provides an automated framework for learning behaviors from high-level reward specifications.
In practice the choice of reward function can be crucial for good results.
arXiv Detail & Related papers (2022-10-18T04:21:25Z) - Automatic Reward Design via Learning Motivation-Consistent Intrinsic
Rewards [46.068337522093096]
We introduce the concept of motivation which captures the underlying goal of maximizing certain rewards.
Our method performs better than the state-of-the-art methods in handling problems of delayed reward, exploration, and credit assignment.
arXiv Detail & Related papers (2022-07-29T14:52:02Z) - A Hierarchical Bayesian Approach to Inverse Reinforcement Learning with
Symbolic Reward Machines [7.661766773170363]
A misspecified reward can degrade sample efficiency and induce undesired behaviors in reinforcement learning problems.
We propose symbolic reward machines for incorporating high-level task knowledge when specifying the reward signals.
arXiv Detail & Related papers (2022-04-20T20:22:00Z) - Learning Probabilistic Reward Machines from Non-Markovian Stochastic
Reward Processes [8.800797834097764]
We introduce probabilistic reward machines (PRMs) as a representation of non-Markovian rewards.
We present an algorithm to learn PRMs from the underlying decision process as well as to learn the PRM representation of a given decision-making policy.
arXiv Detail & Related papers (2021-07-09T19:00:39Z) - Disentangled Planning and Control in Vision Based Robotics via Reward
Machines [13.486750561133634]
We augment a Deep Q-Learning agent with a Reward Machine (DQRM) to increase speed of learning vision-based policies for robot tasks.
A reward machine (RM) is a finite state machine that decomposes a task into a discrete planning graph and equips the agent with a reward function to guide it toward task completion.
arXiv Detail & Related papers (2020-12-28T19:54:40Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z) - Reward Machines: Exploiting Reward Function Structure in Reinforcement
Learning [22.242379207077217]
We show how to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies.
First, we propose reward machines, a type of finite state machine that supports the specification of reward functions.
We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.
arXiv Detail & Related papers (2020-10-06T00:10:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.