Corruption-Robust Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2106.06630v1
- Date: Fri, 11 Jun 2021 22:41:53 GMT
- Title: Corruption-Robust Offline Reinforcement Learning
- Authors: Xuezhou Zhang, Yiding Chen, Jerry Zhu, Wen Sun
- Abstract summary: We study adversarial robustness in offline reinforcement learning.
We show that a worst-case $Omega(depsilon) optimality gap is unavoidable.
We propose robust variants of the Least-Square Value Iteration (LSVI) algorithm.
- Score: 19.300465320692066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the adversarial robustness in offline reinforcement learning. Given
a batch dataset consisting of tuples $(s, a, r, s')$, an adversary is allowed
to arbitrarily modify $\epsilon$ fraction of the tuples. From the corrupted
dataset the learner aims to robustly identify a near-optimal policy. We first
show that a worst-case $\Omega(d\epsilon)$ optimality gap is unavoidable in
linear MDP of dimension $d$, even if the adversary only corrupts the reward
element in a tuple. This contrasts with dimension-free results in robust
supervised learning and best-known lower-bound in the online RL setting with
corruption. Next, we propose robust variants of the Least-Square Value
Iteration (LSVI) algorithm utilizing robust supervised learning oracles, which
achieve near-matching performances in cases both with and without full data
coverage. The algorithm requires the knowledge of $\epsilon$ to design the
pessimism bonus in the no-coverage case. Surprisingly, in this case, the
knowledge of $\epsilon$ is necessary, as we show that being adaptive to unknown
$\epsilon$ is impossible.This again contrasts with recent results on
corruption-robust online RL and implies that robust offline RL is a strictly
harder problem.
Related papers
- Misspecified $Q$-Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error [25.777423855881878]
We show a novel elimination-based algorithm to show one can obtain an $Oleft(Hepsilonright)$-optimal policy.
We complement our upper bound with an $widetildeOmegaleft(Hepsilonright)$-optimality lower bound, giving a complete picture of this problem.
arXiv Detail & Related papers (2024-07-18T15:58:04Z) - Robust Reinforcement Learning from Corrupted Human Feedback [86.17030012828003]
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data.
We propose a robust RLHF approach -- $R3M$, which models the potentially corrupted preference label as sparse outliers.
Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R3M$ improves robustness of the reward against several types of perturbations to the preference data.
arXiv Detail & Related papers (2024-06-21T18:06:30Z) - Towards Robust Model-Based Reinforcement Learning Against Adversarial Corruption [60.958746600254884]
This study tackles the challenges of adversarial corruption in model-based reinforcement learning (RL)
We introduce an algorithm called corruption-robust optimistic MLE (CR-OMLE), which leverages total-variation (TV)-based information ratios as uncertainty weights for MLE.
We extend our weighting technique to the offline setting, and propose an algorithm named corruption-robust pessimistic MLE (CR-PMLE)
arXiv Detail & Related papers (2024-02-14T07:27:30Z) - Corruption-Robust Offline Reinforcement Learning with General Function
Approximation [60.91257031278004]
We investigate the problem of corruption in offline reinforcement learning (RL) with general function approximation.
Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs)
arXiv Detail & Related papers (2023-10-23T04:07:26Z) - A Robust Phased Elimination Algorithm for Corruption-Tolerant Gaussian
Process Bandits [118.22458816174144]
We propose a novel robust elimination-type algorithm that runs in epochs, combines exploration with infrequent switching to select a small subset of actions, and plays each action for multiple time instants.
Our algorithm, GP Robust Phased Elimination (RGP-PE), successfully balances robustness to corruptions with exploration and exploitation.
We perform the first empirical study of robustness in the corrupted GP bandit setting, and show that our algorithm is robust against a variety of adversarial attacks.
arXiv Detail & Related papers (2022-02-03T21:19:36Z) - Linear Contextual Bandits with Adversarial Corruptions [91.38793800392108]
We study the linear contextual bandit problem in the presence of adversarial corruption.
We present a variance-aware algorithm that is adaptive to the level of adversarial contamination $C$.
arXiv Detail & Related papers (2021-10-25T02:53:24Z) - Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function.
The goal is to find a policy that matches the expert's performance on some predefined set of cost functions.
We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z) - Robust Policy Gradient against Strong Data Corruption [30.910088777897045]
We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions.
Our attack model assumes an textitadaptive adversary who can arbitrarily corrupt the reward and transition at every step within an episode.
We develop a Filtered Policy Gradient algorithm that can tolerate even reward corruption and can find an $O(epsilon1/4)$-optimal policy.
arXiv Detail & Related papers (2021-02-11T01:48:38Z) - Towards Deep Learning Models Resistant to Large Perturbations [0.0]
Adversarial robustness has proven to be a required property of machine learning algorithms.
We show that the well-established algorithm called "adversarial training" fails to train a deep neural network given a large, but reasonable, perturbation magnitude.
arXiv Detail & Related papers (2020-03-30T12:03:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.