Related papers: Corruption Robust Offline Reinforcement Learning with Human Feedback

Corruption Robust Offline Reinforcement Learning with Human Feedback

URL: http://arxiv.org/abs/2402.06734v1
Date: Fri, 9 Feb 2024 19:09:48 GMT
Title: Corruption Robust Offline Reinforcement Learning with Human Feedback
Authors: Debmalya Mandal, Andi Nika, Parameswaran Kamalaruban, Adish Singla, and Goran Radanovi\'c
Abstract summary: We study data corruption robustness for reinforcement learning with human feedback (RLHF) in an offline setting. We aim to design algorithms that identify a near-optimal policy from the corrupted data, with provable guarantees.
Score: 33.33154679893122
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study data corruption robustness for reinforcement learning with human feedback (RLHF) in an offline setting. Given an offline dataset of pairs of trajectories along with feedback about human preferences, an $\varepsilon$-fraction of the pairs is corrupted (e.g., feedback flipped or trajectory features manipulated), capturing an adversarial attack or noisy human preferences. We aim to design algorithms that identify a near-optimal policy from the corrupted data, with provable guarantees. Existing theoretical works have separately studied the settings of corruption robust RL (learning from scalar rewards directly under corruption) and offline RLHF (learning from human feedback without corruption); however, they are inapplicable to our problem of dealing with corrupted data in offline RLHF setting. To this end, we design novel corruption robust offline RLHF methods under various assumptions on the coverage of the data-generating distributions. At a high level, our methodology robustifies an offline RLHF framework by first learning a reward model along with confidence sets and then learning a pessimistic optimal policy over the confidence set. Our key insight is that learning optimal policy can be done by leveraging an offline corruption-robust RL oracle in different ways (e.g., zero-order oracle or first-order oracle), depending on the data coverage assumptions. To our knowledge, ours is the first work that provides provable corruption robust offline RLHF methods.

Related papers

Enhancing Robustness of Offline Reinforcement Learning Under Data Corruption via Sharpness-Aware Minimization [9.524029391786557]
offline reinforcement learning vulnerable to real-world data corruption.<n>We are first to apply Sharpness-Aware Minimization (SAM) as a general-purpose plug-and-play for offline RL.<n>We integrate SAM into strong baselines for data corruption: IQL, a top-performing offline RL algorithm, and RIQL, an algorithm designed specifically for data-corruption robustness.
arXiv Detail & Related papers (2025-11-14T06:11:13Z)
Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment [89.26250000307215]
Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) are important techniques to align large language models with human preference.<n>However, the quality of RLHF and DPO training is seriously compromised by textittextbfCorrupted preference, reward textittextbfOveroptimization, and bias towards textittextbfVerbosity.<n>We propose RLHF-textbfCOV and DPO-textbfCOV algorithms that
arXiv Detail & Related papers (2025-10-07T02:32:47Z)
On Corruption-Robustness in Performative Reinforcement Learning [13.509499718691016]
We study the convergence of repeated retraining approaches to a performatively stable policy.<n>We extend these approaches to operate under corrupted data.<n>We prove that our approach exhibits last-ite convergence to an approximately stable policy.
arXiv Detail & Related papers (2025-05-08T19:37:35Z)
Uncertainty-based Offline Variational Bayesian Reinforcement Learning for Robustness under Diverse Data Corruptions [8.666879925570331]
Real-world offline datasets are often subject to data corruptions due to sensor failures or malicious attacks. Existing methods struggle to learn robust agents under high uncertainty caused by corrupted data. We propose a novel robust variational Bayesian inference for offline RL (TRACER)
arXiv Detail & Related papers (2024-11-01T09:28:24Z)
Optimal Design for Reward Modeling in RLHF [83.3614658277817]
We formalize the reward training model in Reinforcement Learning from Human Feedback. We frame the selection of an effective dataset as a simple regret minimization task. We derive bounds on the simple regret under appropriate assumptions.
arXiv Detail & Related papers (2024-10-22T14:36:44Z)
Tackling Data Corruption in Offline Reinforcement Learning via Sequence Modeling [35.2859997591196]
offline reinforcement learning holds promise for scaling data-driven decision-making. However, real-world data collected from sensors or humans often contains noise and errors. Our study reveals that prior research falls short under data corruption when the dataset is limited.
arXiv Detail & Related papers (2024-07-05T06:34:32Z)
Robust Reinforcement Learning from Corrupted Human Feedback [86.17030012828003]
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. We propose a robust RLHF approach -- $R3M$, which models the potentially corrupted preference label as sparse outliers. Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R3M$ improves robustness of the reward against several types of perturbations to the preference data.
arXiv Detail & Related papers (2024-06-21T18:06:30Z)
Online Bandit Learning with Offline Preference Data [15.799929216215672]
We propose a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference feedback. We show that by modeling the 'competence' of the expert that generated it, we are able to use such a dataset most effectively.
arXiv Detail & Related papers (2024-06-13T20:25:52Z)
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF [80.32171988565999]
We introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO) VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function. Experiments on text summarization and dialog verify the practicality and effectiveness of VPO.
arXiv Detail & Related papers (2024-05-29T17:51:42Z)
Towards Robust Model-Based Reinforcement Learning Against Adversarial Corruption [60.958746600254884]
This study tackles the challenges of adversarial corruption in model-based reinforcement learning (RL) We introduce an algorithm called corruption-robust optimistic MLE (CR-OMLE), which leverages total-variation (TV)-based information ratios as uncertainty weights for MLE. We extend our weighting technique to the offline setting, and propose an algorithm named corruption-robust pessimistic MLE (CR-PMLE)
arXiv Detail & Related papers (2024-02-14T07:27:30Z)
Corruption-Robust Offline Reinforcement Learning with General Function Approximation [60.91257031278004]
We investigate the problem of corruption in offline reinforcement learning (RL) with general function approximation. Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs)
arXiv Detail & Related papers (2023-10-23T04:07:26Z)
Corruption-Robust Offline Reinforcement Learning [19.300465320692066]
We study adversarial robustness in offline reinforcement learning. We show that a worst-case $Omega(depsilon) optimality gap is unavoidable. We propose robust variants of the Least-Square Value Iteration (LSVI) algorithm.
arXiv Detail & Related papers (2021-06-11T22:41:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.