Related papers: Designing Rewards for Fast Learning

Designing Rewards for Fast Learning

URL: http://arxiv.org/abs/2205.15400v1
Date: Mon, 30 May 2022 19:48:52 GMT
Title: Designing Rewards for Fast Learning
Authors: Henry Sowerby, Zhiyuan Zhou, Michael L. Littman
Abstract summary: We look at how reward-design choices impact learning speed and seek to identify principles of good reward design that quickly induce target behavior. We propose a linear-programming based algorithm that efficiently finds a reward function that maximizes action gap and minimizes subjective discount.
Score: 18.032654606016447
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To convey desired behavior to a Reinforcement Learning (RL) agent, a designer must choose a reward function for the environment, arguably the most important knob designers have in interacting with RL agents. Although many reward functions induce the same optimal behavior (Ng et al., 1999), in practice, some of them result in faster learning than others. In this paper, we look at how reward-design choices impact learning speed and seek to identify principles of good reward design that quickly induce target behavior. This reward-identification problem is framed as an optimization problem: Firstly, we advocate choosing state-based rewards that maximize the action gap, making optimal actions easy to distinguish from suboptimal ones. Secondly, we propose minimizing a measure of the horizon, something we call the "subjective discount", over which rewards need to be optimized to encourage agents to make optimal decisions with less lookahead. To solve this optimization problem, we propose a linear-programming based algorithm that efficiently finds a reward function that maximizes action gap and minimizes subjective discount. We test the rewards generated with the algorithm in tabular environments with Q-Learning, and empirically show they lead to faster learning. Although we only focus on Q-Learning because it is perhaps the simplest and most well understood RL algorithm, preliminary results with R-max (Brafman and Tennenholtz, 2000) suggest our results are much more general. Our experiments support three principles of reward design: 1) consistent with existing results, penalizing each step taken induces faster learning than rewarding the goal. 2) When rewarding subgoals along the target trajectory, rewards should gradually increase as the goal gets closer. 3) Dense reward that's nonzero on every state is only good if designed carefully.

Related papers

To the Max: Reinventing Reward in Reinforcement Learning [1.5498250598583487]
In reinforcement learning (RL), different reward functions can define the same optimal policy but result in drastically different learning performance. We introduce textitmax-reward RL, where an agent optimize the maximum rather than the cumulative reward. In experiments, we study the performance of max-reward RL algorithms in two goal-reaching environments from Gymnasium-Robotics.
arXiv Detail & Related papers (2024-02-02T12:29:18Z)
Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output. We use these attention weights to redistribute the reward along the whole completion. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z)
REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Behavior Alignment via Reward Function Optimization [23.92721220310242]
We introduce a new framework that integrates auxiliary rewards reflecting a designer's domain knowledge with the environment's primary rewards. We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges.
arXiv Detail & Related papers (2023-10-29T13:45:07Z)
STARC: A General Framework For Quantifying Differences Between Reward Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret. We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z)
Mind the Gap: Offline Policy Optimization for Imperfect Rewards [14.874900923808408]
We propose a unified offline policy optimization approach, textitRGM (Reward Gap Minimization), which can handle diverse types of imperfect rewards. By exploiting the duality of the lower layer, we derive a tractable algorithm that enables sampled-based learning without any online interactions.
arXiv Detail & Related papers (2023-02-03T11:39:50Z)
Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior. This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z)
Automatic Reward Design via Learning Motivation-Consistent Intrinsic Rewards [46.068337522093096]
We introduce the concept of motivation which captures the underlying goal of maximizing certain rewards. Our method performs better than the state-of-the-art methods in handling problems of delayed reward, exploration, and credit assignment.
arXiv Detail & Related papers (2022-07-29T14:52:02Z)
A Study on Dense and Sparse (Visual) Rewards in Robot Policy Learning [19.67628391301068]
We study the performance of multiple state-of-the-art deep reinforcement learning algorithms under different types of reward. Our results show that visual dense rewards are more successful than visual sparse rewards and that there is no single best algorithm for all tasks.
arXiv Detail & Related papers (2021-08-06T17:47:48Z)
Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping [71.214923471669]
Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL) In this paper, we consider the problem of adaptively utilizing a given shaping reward function. Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards.
arXiv Detail & Related papers (2020-11-05T05:34:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.