Mind the Gap: Offline Policy Optimization for Imperfect Rewards
- URL: http://arxiv.org/abs/2302.01667v1
- Date: Fri, 3 Feb 2023 11:39:50 GMT
- Title: Mind the Gap: Offline Policy Optimization for Imperfect Rewards
- Authors: Jianxiong Li, Xiao Hu, Haoran Xu, Jingjing Liu, Xianyuan Zhan,
Qing-Shan Jia, Ya-Qin Zhang
- Abstract summary: We propose a unified offline policy optimization approach, textitRGM (Reward Gap Minimization), which can handle diverse types of imperfect rewards.
By exploiting the duality of the lower layer, we derive a tractable algorithm that enables sampled-based learning without any online interactions.
- Score: 14.874900923808408
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward function is essential in reinforcement learning (RL), serving as the
guiding signal to incentivize agents to solve given tasks, however, is also
notoriously difficult to design. In many cases, only imperfect rewards are
available, which inflicts substantial performance loss for RL agents. In this
study, we propose a unified offline policy optimization approach, \textit{RGM
(Reward Gap Minimization)}, which can smartly handle diverse types of imperfect
rewards. RGM is formulated as a bi-level optimization problem: the upper layer
optimizes a reward correction term that performs visitation distribution
matching w.r.t. some expert data; the lower layer solves a pessimistic RL
problem with the corrected rewards. By exploiting the duality of the lower
layer, we derive a tractable algorithm that enables sampled-based learning
without any online interactions. Comprehensive experiments demonstrate that RGM
achieves superior performance to existing methods under diverse settings of
imperfect rewards. Further, RGM can effectively correct wrong or inconsistent
rewards against expert preference and retrieve useful information from biased
rewards.
Related papers
- R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback [25.27230140274847]
Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences.
This paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - ELO-Rated Sequence Rewards: Advancing Reinforcement Learning Models [3.8616427106430677]
Reinforcement Learning (RL) is highly dependent on the meticulous design of the reward function.
We propose a novel reward estimation algorithm: ELO-Rating based RL (ERRL)
arXiv Detail & Related papers (2024-09-05T07:14:03Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Behavior Alignment via Reward Function Optimization [23.92721220310242]
We introduce a new framework that integrates auxiliary rewards reflecting a designer's domain knowledge with the environment's primary rewards.
We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges.
arXiv Detail & Related papers (2023-10-29T13:45:07Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible.
In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types.
We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z) - Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping [71.214923471669]
Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL)
In this paper, we consider the problem of adaptively utilizing a given shaping reward function.
Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards.
arXiv Detail & Related papers (2020-11-05T05:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.