Mind the Gap: Offline Policy Optimization for Imperfect Rewards
- URL: http://arxiv.org/abs/2302.01667v1
- Date: Fri, 3 Feb 2023 11:39:50 GMT
- Title: Mind the Gap: Offline Policy Optimization for Imperfect Rewards
- Authors: Jianxiong Li, Xiao Hu, Haoran Xu, Jingjing Liu, Xianyuan Zhan,
Qing-Shan Jia, Ya-Qin Zhang
- Abstract summary: We propose a unified offline policy optimization approach, textitRGM (Reward Gap Minimization), which can handle diverse types of imperfect rewards.
By exploiting the duality of the lower layer, we derive a tractable algorithm that enables sampled-based learning without any online interactions.
- Score: 14.874900923808408
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward function is essential in reinforcement learning (RL), serving as the
guiding signal to incentivize agents to solve given tasks, however, is also
notoriously difficult to design. In many cases, only imperfect rewards are
available, which inflicts substantial performance loss for RL agents. In this
study, we propose a unified offline policy optimization approach, \textit{RGM
(Reward Gap Minimization)}, which can smartly handle diverse types of imperfect
rewards. RGM is formulated as a bi-level optimization problem: the upper layer
optimizes a reward correction term that performs visitation distribution
matching w.r.t. some expert data; the lower layer solves a pessimistic RL
problem with the corrected rewards. By exploiting the duality of the lower
layer, we derive a tractable algorithm that enables sampled-based learning
without any online interactions. Comprehensive experiments demonstrate that RGM
achieves superior performance to existing methods under diverse settings of
imperfect rewards. Further, RGM can effectively correct wrong or inconsistent
rewards against expert preference and retrieve useful information from biased
rewards.
Related papers
- Process Reinforcement through Implicit Rewards [95.7442934212076]
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs)
Dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards.
This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive.
We propose PRIME, which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards
arXiv Detail & Related papers (2025-02-03T15:43:48Z) - R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback [25.27230140274847]
Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences.
This paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - ELO-Rated Sequence Rewards: Advancing Reinforcement Learning Models [3.8616427106430677]
Reinforcement Learning (RL) is highly dependent on the meticulous design of the reward function.
We propose a novel reward estimation algorithm: ELO-Rating based RL (ERRL)
arXiv Detail & Related papers (2024-09-05T07:14:03Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.
Recent methods aim to mitigate misalignment by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Behavior Alignment via Reward Function Optimization [23.92721220310242]
We introduce a new framework that integrates auxiliary rewards reflecting a designer's domain knowledge with the environment's primary rewards.
We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges.
arXiv Detail & Related papers (2023-10-29T13:45:07Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible.
In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types.
We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z) - Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping [71.214923471669]
Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL)
In this paper, we consider the problem of adaptively utilizing a given shaping reward function.
Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards.
arXiv Detail & Related papers (2020-11-05T05:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.