Corruption-Robust Offline Reinforcement Learning with General Function
Approximation
- URL: http://arxiv.org/abs/2310.14550v3
- Date: Mon, 19 Feb 2024 06:03:26 GMT
- Title: Corruption-Robust Offline Reinforcement Learning with General Function
Approximation
- Authors: Chenlu Ye, Rui Yang, Quanquan Gu, Tong Zhang
- Abstract summary: We investigate the problem of corruption in offline reinforcement learning (RL) with general function approximation.
Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs)
- Score: 60.91257031278004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the problem of corruption robustness in offline reinforcement
learning (RL) with general function approximation, where an adversary can
corrupt each sample in the offline dataset, and the corruption level
$\zeta\geq0$ quantifies the cumulative corruption amount over $n$ episodes and
$H$ steps. Our goal is to find a policy that is robust to such corruption and
minimizes the suboptimality gap with respect to the optimal policy for the
uncorrupted Markov decision processes (MDPs). Drawing inspiration from the
uncertainty-weighting technique from the robust online RL setting
\citep{he2022nearly,ye2022corruptionrobust}, we design a new uncertainty weight
iteration procedure to efficiently compute on batched samples and propose a
corruption-robust algorithm for offline RL. Notably, under the assumption of
single policy coverage and the knowledge of $\zeta$, our proposed algorithm
achieves a suboptimality bound that is worsened by an additive factor of
$\mathcal{O}(\zeta (C(\widehat{\mathcal{F}},\mu)n)^{-1})$ due to the
corruption. Here $\widehat{\mathcal{F}}$ is the confidence set, and the dataset
$\mathcal{Z}_n^H$, and $C(\widehat{\mathcal{F}},\mu)$ is a coefficient that
depends on $\widehat{\mathcal{F}}$ and the underlying data distribution $\mu$.
When specialized to linear MDPs, the corruption-dependent error term reduces to
$\mathcal{O}(\zeta d n^{-1})$ with $d$ being the dimension of the feature map,
which matches the existing lower bound for corrupted linear MDPs. This suggests
that our analysis is tight in terms of the corruption-dependent term.
Related papers
- Robust Distribution Learning with Local and Global Adversarial Corruptions [17.22168727622332]
We develop an efficient finite-sample algorithm with error bounded by $sqrtvarepsilon k + rho + tildeO(dsqrtkn-1/(k lor 2))$ when $P$ has bounded covariance.
Our efficient procedure relies on a novel trace norm approximation of an ideal yet intractable 2-Wasserstein projection estimator.
arXiv Detail & Related papers (2024-06-10T17:48:36Z) - Towards Robust Model-Based Reinforcement Learning Against Adversarial Corruption [60.958746600254884]
This study tackles the challenges of adversarial corruption in model-based reinforcement learning (RL)
We introduce an algorithm called corruption-robust optimistic MLE (CR-OMLE), which leverages total-variation (TV)-based information ratios as uncertainty weights for MLE.
We extend our weighting technique to the offline setting, and propose an algorithm named corruption-robust pessimistic MLE (CR-PMLE)
arXiv Detail & Related papers (2024-02-14T07:27:30Z) - Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear
Contextual Bandits and Markov Decision Processes [59.61248760134937]
We propose an efficient algorithm to achieve a regret of $tildeO(sqrtT+zeta)$.
The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit.
We generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $zeta$.
arXiv Detail & Related papers (2022-12-12T15:04:56Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Linear Contextual Bandits with Adversarial Corruptions [91.38793800392108]
We study the linear contextual bandit problem in the presence of adversarial corruption.
We present a variance-aware algorithm that is adaptive to the level of adversarial contamination $C$.
arXiv Detail & Related papers (2021-10-25T02:53:24Z) - A Model Selection Approach for Corruption Robust Reinforcement Learning [33.39130388569606]
We develop a model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward.
Our algorithm achieves a regret bound of $widetildemathcalO(minfrac1Delta, sqrtT+C)$ where $T$ is the number of episodes, $C$ is the total amount of corruption, and $Delta$ is the reward gap between the best and the second-best policy.
arXiv Detail & Related papers (2021-10-07T15:59:01Z) - Corruption-Robust Offline Reinforcement Learning [19.300465320692066]
We study adversarial robustness in offline reinforcement learning.
We show that a worst-case $Omega(depsilon) optimality gap is unavoidable.
We propose robust variants of the Least-Square Value Iteration (LSVI) algorithm.
arXiv Detail & Related papers (2021-06-11T22:41:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.