REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback
- URL: http://arxiv.org/abs/2312.14436v2
- Date: Sun, 14 Apr 2024 20:07:19 GMT
- Title: REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback
- Authors: Souradip Chakraborty, Anukriti Singh, Amisha Bhaskar, Pratap Tokekar, Dinesh Manocha, Amrit Singh Bedi,
- Abstract summary: A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
- Score: 61.54791065013767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The effectiveness of reinforcement learning (RL) agents in continuous control robotics tasks is heavily dependent on the design of the underlying reward function. However, a misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences; however, they inadvertently introduce a risk of reward overoptimization. In this work, we address this challenge by advocating for the adoption of regularized reward functions that more accurately mirror the intended behaviors. We propose a novel concept of reward regularization within the robotic RLHF (RL from Human Feedback) framework, which we refer to as \emph{agent preferences}. Our approach uniquely incorporates not just human feedback in the form of preferences but also considers the preferences of the RL agent itself during the reward function learning process. This dual consideration significantly mitigates the issue of reward function overoptimization in RL. We provide a theoretical justification for the proposed approach by formulating the robotic RLHF problem as a bilevel optimization problem. We demonstrate the efficiency of our algorithm {\ours} in several continuous control benchmarks including DeepMind Control Suite \cite{tassa2018deepmind} and MetaWorld \cite{yu2021metaworld} and high dimensional visual environments, with an improvement of more than 70\% in sample efficiency in comparison to current SOTA baselines. This showcases our approach's effectiveness in aligning reward functions with true behavioral intentions, setting a new benchmark in the field.
Related papers
- Behavior Alignment via Reward Function Optimization [23.92721220310242]
We introduce a new framework that integrates auxiliary rewards reflecting a designer's domain knowledge with the environment's primary rewards.
We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges.
arXiv Detail & Related papers (2023-10-29T13:45:07Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - A Multiplicative Value Function for Safe and Efficient Reinforcement
Learning [131.96501469927733]
We propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic.
The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns.
We evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations.
arXiv Detail & Related papers (2023-03-07T18:29:15Z) - Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles [16.916111322004557]
A black-box objective function can only be gauged via a ranking oracle.
We introduce ZO-RankSGD, an innovative zeroth-order optimization algorithm.
We show that ZO-RankSGD can significantly enhance the detail of generated images with only a few rounds of human feedback.
arXiv Detail & Related papers (2023-03-07T09:20:43Z) - Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment.
We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent.
We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.