Robust Policy Gradient against Strong Data Corruption
- URL: http://arxiv.org/abs/2102.05800v1
- Date: Thu, 11 Feb 2021 01:48:38 GMT
- Title: Robust Policy Gradient against Strong Data Corruption
- Authors: Xuezhou Zhang, Yiding Chen, Xiaojin Zhu and Wen Sun
- Abstract summary: We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions.
Our attack model assumes an textitadaptive adversary who can arbitrarily corrupt the reward and transition at every step within an episode.
We develop a Filtered Policy Gradient algorithm that can tolerate even reward corruption and can find an $O(epsilon1/4)$-optimal policy.
- Score: 30.910088777897045
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of robust reinforcement learning under adversarial
corruption on both rewards and transitions. Our attack model assumes an
\textit{adaptive} adversary who can arbitrarily corrupt the reward and
transition at every step within an episode, for at most $\epsilon$-fraction of
the learning episodes. Our attack model is strictly stronger than those
considered in prior works. Our first result shows that no algorithm can find a
better than $O(\epsilon)$-optimal policy under our attack model. Next, we show
that surprisingly the natural policy gradient (NPG) method retains a natural
robustness property if the reward corruption is bounded, and can find an
$O(\sqrt{\epsilon})$-optimal policy. Consequently, we develop a Filtered Policy
Gradient (FPG) algorithm that can tolerate even unbounded reward corruption and
can find an $O(\epsilon^{1/4})$-optimal policy. We emphasize that FPG is the
first that can achieve a meaningful learning guarantee when a constant fraction
of episodes are corrupted. Complimentary to the theoretical results, we show
that a neural implementation of FPG achieves strong robust learning performance
on the MuJoCo continuous control benchmarks.
Related papers
- Uncertainty-Aware Reward-Free Exploration with General Function Approximation [69.27868448449755]
In this paper, we propose a reward-free reinforcement learning algorithm called alg.
The key idea behind our algorithm is an uncertainty-aware intrinsic reward for exploring the environment.
Experiment results show that GFA-RFE outperforms or is comparable to the performance of state-of-the-art unsupervised RL algorithms.
arXiv Detail & Related papers (2024-06-24T01:37:18Z) - Convergence of a model-free entropy-regularized inverse reinforcement learning algorithm [6.481009996429766]
inverse reinforcement learning (IRL) aims to recover a reward for which the expert is optimal.
This work proposes a model-free algorithm to solve the entropy-regularized IRL problem.
arXiv Detail & Related papers (2024-03-25T14:54:42Z) - Corruption-Robust Offline Reinforcement Learning with General Function
Approximation [60.91257031278004]
We investigate the problem of corruption in offline reinforcement learning (RL) with general function approximation.
Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs)
arXiv Detail & Related papers (2023-10-23T04:07:26Z) - Best of Both Worlds Policy Optimization [33.13041034490332]
We show that by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog$(T)$ regret when the losses are adversarial.
This is the first time a gap-dependent polylog$(T)$ regret bound is shown for policy optimization.
arXiv Detail & Related papers (2023-02-18T19:46:11Z) - Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear
Contextual Bandits and Markov Decision Processes [59.61248760134937]
We propose an efficient algorithm to achieve a regret of $tildeO(sqrtT+zeta)$.
The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit.
We generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $zeta$.
arXiv Detail & Related papers (2022-12-12T15:04:56Z) - A Robust Phased Elimination Algorithm for Corruption-Tolerant Gaussian
Process Bandits [118.22458816174144]
We propose a novel robust elimination-type algorithm that runs in epochs, combines exploration with infrequent switching to select a small subset of actions, and plays each action for multiple time instants.
Our algorithm, GP Robust Phased Elimination (RGP-PE), successfully balances robustness to corruptions with exploration and exploitation.
We perform the first empirical study of robustness in the corrupted GP bandit setting, and show that our algorithm is robust against a variety of adversarial attacks.
arXiv Detail & Related papers (2022-02-03T21:19:36Z) - Linear Contextual Bandits with Adversarial Corruptions [91.38793800392108]
We study the linear contextual bandit problem in the presence of adversarial corruption.
We present a variance-aware algorithm that is adaptive to the level of adversarial contamination $C$.
arXiv Detail & Related papers (2021-10-25T02:53:24Z) - Corruption-Robust Offline Reinforcement Learning [19.300465320692066]
We study adversarial robustness in offline reinforcement learning.
We show that a worst-case $Omega(depsilon) optimality gap is unavoidable.
We propose robust variants of the Least-Square Value Iteration (LSVI) algorithm.
arXiv Detail & Related papers (2021-06-11T22:41:53Z) - Adaptive Reward-Free Exploration [48.98199700043158]
We show that our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994.
We further investigate the relative complexities of reward-free exploration and best-policy identification.
arXiv Detail & Related papers (2020-06-11T09:58:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.