Related papers: Robust Policy Gradient against Strong Data Corruption

Robust Policy Gradient against Strong Data Corruption

URL: http://arxiv.org/abs/2102.05800v1
Date: Thu, 11 Feb 2021 01:48:38 GMT
Title: Robust Policy Gradient against Strong Data Corruption
Authors: Xuezhou Zhang, Yiding Chen, Xiaojin Zhu and Wen Sun
Abstract summary: We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions. Our attack model assumes an textitadaptive adversary who can arbitrarily corrupt the reward and transition at every step within an episode. We develop a Filtered Policy Gradient algorithm that can tolerate even reward corruption and can find an $O(epsilon1/4)$-optimal policy.
Score: 30.910088777897045
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions. Our attack model assumes an \textit{adaptive} adversary who can arbitrarily corrupt the reward and transition at every step within an episode, for at most $\epsilon$-fraction of the learning episodes. Our attack model is strictly stronger than those considered in prior works. Our first result shows that no algorithm can find a better than $O(\epsilon)$-optimal policy under our attack model. Next, we show that surprisingly the natural policy gradient (NPG) method retains a natural robustness property if the reward corruption is bounded, and can find an $O(\sqrt{\epsilon})$-optimal policy. Consequently, we develop a Filtered Policy Gradient (FPG) algorithm that can tolerate even unbounded reward corruption and can find an $O(\epsilon^{1/4})$-optimal policy. We emphasize that FPG is the first that can achieve a meaningful learning guarantee when a constant fraction of episodes are corrupted. Complimentary to the theoretical results, we show that a neural implementation of FPG achieves strong robust learning performance on the MuJoCo continuous control benchmarks.

Related papers

Uncertainty-Aware Reward-Free Exploration with General Function Approximation [69.27868448449755]
In this paper, we propose a reward-free reinforcement learning algorithm called alg. The key idea behind our algorithm is an uncertainty-aware intrinsic reward for exploring the environment. Experiment results show that GFA-RFE outperforms or is comparable to the performance of state-of-the-art unsupervised RL algorithms.
arXiv Detail & Related papers (2024-06-24T01:37:18Z)
Convergence of a model-free entropy-regularized inverse reinforcement learning algorithm [6.481009996429766]
inverse reinforcement learning (IRL) aims to recover a reward for which the expert is optimal. This work proposes a model-free algorithm to solve the entropy-regularized IRL problem.
arXiv Detail & Related papers (2024-03-25T14:54:42Z)
Corruption-Robust Offline Reinforcement Learning with General Function Approximation [60.91257031278004]
We investigate the problem of corruption in offline reinforcement learning (RL) with general function approximation. Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs)
arXiv Detail & Related papers (2023-10-23T04:07:26Z)
Best of Both Worlds Policy Optimization [33.13041034490332]
We show that by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog$(T)$ regret when the losses are adversarial. This is the first time a gap-dependent polylog$(T)$ regret bound is shown for policy optimization.
arXiv Detail & Related papers (2023-02-18T19:46:11Z)
Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes [59.61248760134937]
We propose an efficient algorithm to achieve a regret of $tildeO(sqrtT+zeta)$. The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit. We generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $zeta$.
arXiv Detail & Related papers (2022-12-12T15:04:56Z)
A Robust Phased Elimination Algorithm for Corruption-Tolerant Gaussian Process Bandits [118.22458816174144]
We propose a novel robust elimination-type algorithm that runs in epochs, combines exploration with infrequent switching to select a small subset of actions, and plays each action for multiple time instants. Our algorithm, GP Robust Phased Elimination (RGP-PE), successfully balances robustness to corruptions with exploration and exploitation. We perform the first empirical study of robustness in the corrupted GP bandit setting, and show that our algorithm is robust against a variety of adversarial attacks.
arXiv Detail & Related papers (2022-02-03T21:19:36Z)
Linear Contextual Bandits with Adversarial Corruptions [91.38793800392108]
We study the linear contextual bandit problem in the presence of adversarial corruption. We present a variance-aware algorithm that is adaptive to the level of adversarial contamination $C$.
arXiv Detail & Related papers (2021-10-25T02:53:24Z)
Corruption-Robust Offline Reinforcement Learning [19.300465320692066]
We study adversarial robustness in offline reinforcement learning. We show that a worst-case $Omega(depsilon) optimality gap is unavoidable. We propose robust variants of the Least-Square Value Iteration (LSVI) algorithm.
arXiv Detail & Related papers (2021-06-11T22:41:53Z)
Adaptive Reward-Free Exploration [48.98199700043158]
We show that our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994. We further investigate the relative complexities of reward-free exploration and best-policy identification.
arXiv Detail & Related papers (2020-06-11T09:58:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.