Reward Machines: Exploiting Reward Function Structure in Reinforcement
Learning
- URL: http://arxiv.org/abs/2010.03950v2
- Date: Mon, 17 Jan 2022 18:12:57 GMT
- Title: Reward Machines: Exploiting Reward Function Structure in Reinforcement
Learning
- Authors: Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, Sheila A.
McIlraith
- Abstract summary: We show how to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies.
First, we propose reward machines, a type of finite state machine that supports the specification of reward functions.
We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.
- Score: 22.242379207077217
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) methods usually treat reward functions as black
boxes. As such, these methods must extensively interact with the environment in
order to discover rewards and optimal policies. In most RL applications,
however, users have to program the reward function and, hence, there is the
opportunity to make the reward function visible -- to show the reward
function's code to the RL agent so it can exploit the function's internal
structure to learn optimal policies in a more sample efficient manner. In this
paper, we show how to accomplish this idea in two steps. First, we propose
reward machines, a type of finite state machine that supports the specification
of reward functions while exposing reward function structure. We then describe
different methodologies to exploit this structure to support learning,
including automated reward shaping, task decomposition, and counterfactual
reasoning with off-policy learning. Experiments on tabular and continuous
domains, across different tasks and RL agents, show the benefits of exploiting
reward structure with respect to sample efficiency and the quality of resultant
policies. Finally, by virtue of being a form of finite state machine, reward
machines have the expressive power of a regular language and as such support
loops, sequences and conditionals, as well as the expression of temporally
extended properties typical of linear temporal logic and non-Markovian reward
specification.
Related papers
- Automated Rewards via LLM-Generated Progress Functions [47.50772243693897]
Large Language Models (LLMs) have the potential to automate reward engineering by leveraging their broad domain knowledge across various tasks.
This paper introduces an LLM-driven reward generation framework that is able to produce state-of-the-art policies on the challenging Bi-DexHands benchmark.
arXiv Detail & Related papers (2024-10-11T18:41:15Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - STARC: A General Framework For Quantifying Differences Between Reward
Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics.
We show that STARC metrics induce both an upper and a lower bound on worst-case regret.
We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z) - Reward Collapse in Aligning Large Language Models [64.98482888193267]
We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution.
Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
arXiv Detail & Related papers (2023-05-28T02:12:00Z) - Preprocessing Reward Functions for Interpretability [2.538209532048867]
We propose exploiting the intrinsic structure of reward functions by first preprocessing them into simpler but equivalent reward functions.
Our empirical evaluation shows that preprocessed rewards are often significantly easier to understand than the original reward.
arXiv Detail & Related papers (2022-03-25T10:19:35Z) - Model-Free Reinforcement Learning for Symbolic Automata-encoded
Objectives [0.0]
Reinforcement learning (RL) is a popular approach for robotic path planning in uncertain environments.
Poorly designed rewards can lead to policies that do get maximal rewards but fail to satisfy desired task objectives or are unsafe.
We propose using formal specifications in the form of symbolic automata.
arXiv Detail & Related papers (2022-02-04T21:54:36Z) - Dynamics-Aware Comparison of Learned Reward Functions [21.159457412742356]
The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world.
Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it.
We propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric.
arXiv Detail & Related papers (2022-01-25T03:48:00Z) - Replacing Rewards with Examples: Example-Based Policy Search via
Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function.
In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved.
Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z) - Efficient Exploration of Reward Functions in Inverse Reinforcement
Learning via Bayesian Optimization [43.51553742077343]
inverse reinforcement learning (IRL) is relevant to a variety of tasks including value alignment and robot learning from demonstration.
This paper presents an IRL framework called Bayesian optimization-IRL (BO-IRL) which identifies multiple solutions consistent with the expert demonstrations.
arXiv Detail & Related papers (2020-11-17T10:17:45Z) - Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping [71.214923471669]
Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL)
In this paper, we consider the problem of adaptively utilizing a given shaping reward function.
Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards.
arXiv Detail & Related papers (2020-11-05T05:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.