A Model Selection Approach for Corruption Robust Reinforcement Learning
- URL: http://arxiv.org/abs/2110.03580v1
- Date: Thu, 7 Oct 2021 15:59:01 GMT
- Title: A Model Selection Approach for Corruption Robust Reinforcement Learning
- Authors: Chen-Yu Wei, Christoph Dann, Julian Zimmert
- Abstract summary: We develop a model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward.
Our algorithm achieves a regret bound of $widetildemathcalO(minfrac1Delta, sqrtT+C)$ where $T$ is the number of episodes, $C$ is the total amount of corruption, and $Delta$ is the reward gap between the best and the second-best policy.
- Score: 33.39130388569606
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We develop a model selection approach to tackle reinforcement learning with
adversarial corruption in both transition and reward. For finite-horizon
tabular MDPs, without prior knowledge on the total amount of corruption, our
algorithm achieves a regret bound of
$\widetilde{\mathcal{O}}(\min\{\frac{1}{\Delta}, \sqrt{T}\}+C)$ where $T$ is
the number of episodes, $C$ is the total amount of corruption, and $\Delta$ is
the reward gap between the best and the second-best policy. This is the first
worst-case optimal bound achieved without knowledge of $C$, improving previous
results of Lykouris et al. (2021); Chen et al. (2021); Wu et al. (2021). For
finite-horizon linear MDPs, we develop a computationally efficient algorithm
with a regret bound of $\widetilde{\mathcal{O}}(\sqrt{(1+C)T})$, and another
computationally inefficient one with $\widetilde{\mathcal{O}}(\sqrt{T}+C)$,
improving the result of Lykouris et al. (2021) and answering an open question
by Zhang et al. (2021b). Finally, our model selection framework can be easily
applied to other settings including linear bandits, linear contextual bandits,
and MDPs with general function approximation, leading to several improved or
new results.
Related papers
- Corruption-Robust Offline Reinforcement Learning with General Function
Approximation [60.91257031278004]
We investigate the problem of corruption in offline reinforcement learning (RL) with general function approximation.
Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs)
arXiv Detail & Related papers (2023-10-23T04:07:26Z) - Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback [30.337826346496385]
We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback.
We introduce two algorithms that achieve improved regret performance compared to existing approaches.
arXiv Detail & Related papers (2023-10-17T19:43:37Z) - A General Framework for Sequential Decision-Making under Adaptivity
Constraints [112.79808969568253]
We study general sequential decision-making under two adaptivity constraints: rare policy switch and batch learning.
For the rare policy switch constraint, we provide an algorithm to achieve a $widetildemathcalO(sqrtK+K/B)$ regret with the number of batches.
For the batch learning constraint, we provide an algorithm that provides a $widetildemathcalO(sqrtK+K/B)$ regret with the number of batches.
arXiv Detail & Related papers (2023-06-26T07:20:25Z) - Tight Bounds for $\gamma$-Regret via the Decision-Estimation Coefficient [88.86699022151598]
We give a statistical characterization of the $gamma$-regret for arbitrary structured bandit problems.
The $gamma$-regret emerges in structured bandit problems over a function class $mathcalF$ where finding an exact optimum of $f in mathcalF$ is intractable.
arXiv Detail & Related papers (2023-03-06T17:54:33Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear
Contextual Bandits and Markov Decision Processes [59.61248760134937]
We propose an efficient algorithm to achieve a regret of $tildeO(sqrtT+zeta)$.
The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit.
We generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $zeta$.
arXiv Detail & Related papers (2022-12-12T15:04:56Z) - Revisiting Modified Greedy Algorithm for Monotone Submodular
Maximization with a Knapsack Constraint [75.85952446237599]
We show that a modified greedy algorithm can achieve an approximation factor of $0.305$.
We derive a data-dependent upper bound on the optimum.
It can also be used to significantly improve the efficiency of such algorithms as branch and bound.
arXiv Detail & Related papers (2020-08-12T15:40:21Z) - Learning Infinite-horizon Average-reward MDPs with Linear Function
Approximation [44.374427255708135]
We develop new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation.
Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal $widetildeO(sqrtT)$ regret.
Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with $widetildeO(sqrtT)$ regret.
arXiv Detail & Related papers (2020-07-23T08:23:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.