Related papers: Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning

Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning

URL: http://arxiv.org/abs/2010.03950v2
Date: Mon, 17 Jan 2022 18:12:57 GMT
Title: Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning
Authors: Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, Sheila A. McIlraith
Abstract summary: We show how to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.
Score: 22.242379207077217
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) methods usually treat reward functions as black boxes. As such, these methods must extensively interact with the environment in order to discover rewards and optimal policies. In most RL applications, however, users have to program the reward function and, hence, there is the opportunity to make the reward function visible -- to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies in a more sample efficient manner. In this paper, we show how to accomplish this idea in two steps. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions while exposing reward function structure. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning. Experiments on tabular and continuous domains, across different tasks and RL agents, show the benefits of exploiting reward structure with respect to sample efficiency and the quality of resultant policies. Finally, by virtue of being a form of finite state machine, reward machines have the expressive power of a regular language and as such support loops, sequences and conditionals, as well as the expression of temporally extended properties typical of linear temporal logic and non-Markovian reward specification.

Related papers

Learning to Reason without External Rewards [100.27210579418562]
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision.<n>We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data.<n>We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal.
arXiv Detail & Related papers (2025-05-26T07:01:06Z)
Learning Explainable Dense Reward Shapes via Bayesian Optimization [45.34810347865996]
We frame reward shaping as an optimization problem focused on token-level credit assignment. We use explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines.
arXiv Detail & Related papers (2025-04-22T21:09:33Z)
Automated Rewards via LLM-Generated Progress Functions [47.50772243693897]
Large Language Models (LLMs) have the potential to automate reward engineering by leveraging their broad domain knowledge across various tasks. This paper introduces an LLM-driven reward generation framework that is able to produce state-of-the-art policies on the challenging Bi-DexHands benchmark.
arXiv Detail & Related papers (2024-10-11T18:41:15Z)
REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
STARC: A General Framework For Quantifying Differences Between Reward Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret. We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z)
Reward Collapse in Aligning Large Language Models [64.98482888193267]
We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
arXiv Detail & Related papers (2023-05-28T02:12:00Z)
Preprocessing Reward Functions for Interpretability [2.538209532048867]
We propose exploiting the intrinsic structure of reward functions by first preprocessing them into simpler but equivalent reward functions. Our empirical evaluation shows that preprocessed rewards are often significantly easier to understand than the original reward.
arXiv Detail & Related papers (2022-03-25T10:19:35Z)
Model-Free Reinforcement Learning for Symbolic Automata-encoded Objectives [0.0]
Reinforcement learning (RL) is a popular approach for robotic path planning in uncertain environments. Poorly designed rewards can lead to policies that do get maximal rewards but fail to satisfy desired task objectives or are unsafe. We propose using formal specifications in the form of symbolic automata.
arXiv Detail & Related papers (2022-02-04T21:54:36Z)
Dynamics-Aware Comparison of Learned Reward Functions [21.159457412742356]
The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world. Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it. We propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric.
arXiv Detail & Related papers (2022-01-25T03:48:00Z)
Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function. In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved. Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z)
Efficient Exploration of Reward Functions in Inverse Reinforcement Learning via Bayesian Optimization [43.51553742077343]
inverse reinforcement learning (IRL) is relevant to a variety of tasks including value alignment and robot learning from demonstration. This paper presents an IRL framework called Bayesian optimization-IRL (BO-IRL) which identifies multiple solutions consistent with the expert demonstrations.
arXiv Detail & Related papers (2020-11-17T10:17:45Z)
Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping [71.214923471669]
Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL) In this paper, we consider the problem of adaptively utilizing a given shaping reward function. Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards.
arXiv Detail & Related papers (2020-11-05T05:34:14Z)
On Reward-Free Reinforcement Learning with Linear Function Approximation [144.4210285338698]
Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest. In this work, we give both positive and negative results for reward-free RL with linear function approximation.
arXiv Detail & Related papers (2020-06-19T17:59:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.