To the Max: Reinventing Reward in Reinforcement Learning
- URL: http://arxiv.org/abs/2402.01361v2
- Date: Mon, 29 Jul 2024 18:07:08 GMT
- Title: To the Max: Reinventing Reward in Reinforcement Learning
- Authors: Grigorii Veviurko, Wendelin Böhmer, Mathijs de Weerdt,
- Abstract summary: In reinforcement learning (RL), different reward functions can define the same optimal policy but result in drastically different learning performance.
We introduce textitmax-reward RL, where an agent optimize the maximum rather than the cumulative reward.
In experiments, we study the performance of max-reward RL algorithms in two goal-reaching environments from Gymnasium-Robotics.
- Score: 1.5498250598583487
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In reinforcement learning (RL), different reward functions can define the same optimal policy but result in drastically different learning performance. For some, the agent gets stuck with a suboptimal behavior, and for others, it solves the task efficiently. Choosing a good reward function is hence an extremely important yet challenging problem. In this paper, we explore an alternative approach for using rewards for learning. We introduce \textit{max-reward RL}, where an agent optimizes the maximum rather than the cumulative reward. Unlike earlier works, our approach works for deterministic and stochastic environments and can be easily combined with state-of-the-art RL algorithms. In the experiments, we study the performance of max-reward RL algorithms in two goal-reaching environments from Gymnasium-Robotics and demonstrate its benefits over standard RL. The code is available at https://github.com/veviurko/To-the-Max.
Related papers
- REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - $f$-Policy Gradients: A General Framework for Goal Conditioned RL using
$f$-Divergences [44.91973620442546]
This paper introduces a novel way to encourage exploration called $f$-Policy Gradients.
We show that $f$-PG has better performance compared to standard policy methods on a challenging gridworld.
arXiv Detail & Related papers (2023-10-10T17:07:05Z) - Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own [59.11934130045106]
We propose Reinforcement Learning with Foundation Priors (RLFP) to utilize guidance and feedback from policy, value, and success-reward foundation models.
Within this framework, we introduce the Foundation-guided Actor-Critic (FAC) algorithm, which enables embodied agents to explore more efficiently with automatic reward functions.
Our method achieves remarkable performances in various manipulation tasks on both real robots and in simulation.
arXiv Detail & Related papers (2023-10-04T07:56:42Z) - Provably Feedback-Efficient Reinforcement Learning via Active Reward
Learning [26.067411894141863]
An appropriate reward function is of paramount importance in specifying a task in reinforcement learning (RL)
Human-in-the-loop (HiL) RL allows humans to communicate complex goals to the RL agent by providing various types of feedback.
We provide an active-learning-based RL algorithm that first explores the environment without specifying a reward function.
arXiv Detail & Related papers (2023-04-18T12:36:09Z) - Extreme Q-Learning: MaxEnt RL without Entropy [88.97516083146371]
Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains.
We introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT)
Using EVT, we derive our Extreme Q-Learning framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms.
arXiv Detail & Related papers (2023-01-05T23:14:38Z) - Designing Rewards for Fast Learning [18.032654606016447]
We look at how reward-design choices impact learning speed and seek to identify principles of good reward design that quickly induce target behavior.
We propose a linear-programming based algorithm that efficiently finds a reward function that maximizes action gap and minimizes subjective discount.
arXiv Detail & Related papers (2022-05-30T19:48:52Z) - Maximum Entropy RL (Provably) Solves Some Robust RL Problems [94.80212602202518]
We prove theoretically that standard maximum entropy RL is robust to some disturbances in the dynamics and the reward function.
Our results suggest that MaxEnt RL by itself is robust to certain disturbances, without requiring any additional modifications.
arXiv Detail & Related papers (2021-03-10T18:45:48Z) - Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible.
In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types.
We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z) - Active Finite Reward Automaton Inference and Reinforcement Learning
Using Queries and Counterexamples [31.31937554018045]
Deep reinforcement learning (RL) methods require intensive data from the exploration of the environment to achieve satisfactory performance.
We propose a framework that enables an RL agent to reason over its exploration process and distill high-level knowledge for effectively guiding its future explorations.
Specifically, we propose a novel RL algorithm that learns high-level knowledge in the form of a finite reward automaton by using the L* learning algorithm.
arXiv Detail & Related papers (2020-06-28T21:13:08Z) - Rewriting History with Inverse RL: Hindsight Inference for Policy
Improvement [137.29281352505245]
We show that hindsight relabeling is inverse RL, an observation that suggests that we can use inverse RL in tandem for RL algorithms to efficiently solve many tasks.
Our experiments confirm that relabeling data using inverse RL accelerates learning in general multi-task settings.
arXiv Detail & Related papers (2020-02-25T18:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.