Corruption Robust Offline Reinforcement Learning with Human Feedback
- URL: http://arxiv.org/abs/2402.06734v1
- Date: Fri, 9 Feb 2024 19:09:48 GMT
- Title: Corruption Robust Offline Reinforcement Learning with Human Feedback
- Authors: Debmalya Mandal, Andi Nika, Parameswaran Kamalaruban, Adish Singla,
and Goran Radanovi\'c
- Abstract summary: We study data corruption robustness for reinforcement learning with human feedback (RLHF) in an offline setting.
We aim to design algorithms that identify a near-optimal policy from the corrupted data, with provable guarantees.
- Score: 33.33154679893122
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study data corruption robustness for reinforcement learning with human
feedback (RLHF) in an offline setting. Given an offline dataset of pairs of
trajectories along with feedback about human preferences, an
$\varepsilon$-fraction of the pairs is corrupted (e.g., feedback flipped or
trajectory features manipulated), capturing an adversarial attack or noisy
human preferences. We aim to design algorithms that identify a near-optimal
policy from the corrupted data, with provable guarantees. Existing theoretical
works have separately studied the settings of corruption robust RL (learning
from scalar rewards directly under corruption) and offline RLHF (learning from
human feedback without corruption); however, they are inapplicable to our
problem of dealing with corrupted data in offline RLHF setting. To this end, we
design novel corruption robust offline RLHF methods under various assumptions
on the coverage of the data-generating distributions. At a high level, our
methodology robustifies an offline RLHF framework by first learning a reward
model along with confidence sets and then learning a pessimistic optimal policy
over the confidence set. Our key insight is that learning optimal policy can be
done by leveraging an offline corruption-robust RL oracle in different ways
(e.g., zero-order oracle or first-order oracle), depending on the data coverage
assumptions. To our knowledge, ours is the first work that provides provable
corruption robust offline RLHF methods.
Related papers
- Uncertainty-based Offline Variational Bayesian Reinforcement Learning for Robustness under Diverse Data Corruptions [8.666879925570331]
Real-world offline datasets are often subject to data corruptions due to sensor failures or malicious attacks.
Existing methods struggle to learn robust agents under high uncertainty caused by corrupted data.
We propose a novel robust variational Bayesian inference for offline RL (TRACER)
arXiv Detail & Related papers (2024-11-01T09:28:24Z) - Optimal Design for Reward Modeling in RLHF [83.3614658277817]
We formalize the reward training model in Reinforcement Learning from Human Feedback.
We frame the selection of an effective dataset as a simple regret minimization task.
We derive bounds on the simple regret under appropriate assumptions.
arXiv Detail & Related papers (2024-10-22T14:36:44Z) - Robust Reinforcement Learning from Corrupted Human Feedback [86.17030012828003]
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data.
We propose a robust RLHF approach -- $R3M$, which models the potentially corrupted preference label as sparse outliers.
Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R3M$ improves robustness of the reward against several types of perturbations to the preference data.
arXiv Detail & Related papers (2024-06-21T18:06:30Z) - Online Bandit Learning with Offline Preference Data [15.799929216215672]
We propose a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference feedback.
We show that by modeling the 'competence' of the expert that generated it, we are able to use such a dataset most effectively.
arXiv Detail & Related papers (2024-06-13T20:25:52Z) - Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF [80.32171988565999]
We introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO)
VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function.
Experiments on text summarization and dialog verify the practicality and effectiveness of VPO.
arXiv Detail & Related papers (2024-05-29T17:51:42Z) - Towards Robust Model-Based Reinforcement Learning Against Adversarial Corruption [60.958746600254884]
This study tackles the challenges of adversarial corruption in model-based reinforcement learning (RL)
We introduce an algorithm called corruption-robust optimistic MLE (CR-OMLE), which leverages total-variation (TV)-based information ratios as uncertainty weights for MLE.
We extend our weighting technique to the offline setting, and propose an algorithm named corruption-robust pessimistic MLE (CR-PMLE)
arXiv Detail & Related papers (2024-02-14T07:27:30Z) - Corruption-Robust Offline Reinforcement Learning with General Function
Approximation [60.91257031278004]
We investigate the problem of corruption in offline reinforcement learning (RL) with general function approximation.
Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs)
arXiv Detail & Related papers (2023-10-23T04:07:26Z) - Corruption-Robust Offline Reinforcement Learning [19.300465320692066]
We study adversarial robustness in offline reinforcement learning.
We show that a worst-case $Omega(depsilon) optimality gap is unavoidable.
We propose robust variants of the Least-Square Value Iteration (LSVI) algorithm.
arXiv Detail & Related papers (2021-06-11T22:41:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.