Corruption-robust exploration in episodic reinforcement learning
- URL: http://arxiv.org/abs/1911.08689v4
- Date: Wed, 1 Nov 2023 03:08:01 GMT
- Title: Corruption-robust exploration in episodic reinforcement learning
- Authors: Thodoris Lykouris, Max Simchowitz, Aleksandrs Slivkins, Wen Sun
- Abstract summary: We study multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system.
Our framework yields efficient algorithms which attain near-optimal regret in the absence of corruptions.
Notably, our work provides the first sublinear regret guarantee which any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning.
- Score: 76.19192549843727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We initiate the study of multi-stage episodic reinforcement learning under
adversarial corruptions in both the rewards and the transition probabilities of
the underlying system extending recent results for the special case of
stochastic bandits. We provide a framework which modifies the aggressive
exploration enjoyed by existing reinforcement learning approaches based on
"optimism in the face of uncertainty", by complementing them with principles
from "action elimination". Importantly, our framework circumvents the major
challenges posed by naively applying action elimination in the RL setting, as
formalized by a lower bound we demonstrate. Our framework yields efficient
algorithms which (a) attain near-optimal regret in the absence of corruptions
and (b) adapt to unknown levels corruption, enjoying regret guarantees which
degrade gracefully in the total corruption encountered. To showcase the
generality of our approach, we derive results for both tabular settings (where
states and actions are finite) as well as linear-function-approximation
settings (where the dynamics and rewards admit a linear underlying
representation). Notably, our work provides the first sublinear regret
guarantee which accommodates any deviation from purely i.i.d. transitions in
the bandit-feedback model for episodic reinforcement learning.
Related papers
- PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling [36.481053480535515]
We introduce PIPER: Primitive-Informed Preference-based Hierarchical reinforcement learning via Hindsight Relabeling.
Our relabeling-based approach is able to mitigate non-stationarity, which is common in existing hierarchical approaches.
In order to prevent infeasible subgoal prediction and avoid degenerate solutions, we propose primitive-informed regularization.
arXiv Detail & Related papers (2024-04-20T17:06:00Z) - Reward Certification for Policy Smoothed Reinforcement Learning [14.804252729195513]
Reinforcement Learning (RL) has achieved remarkable success in safety-critical areas.
Recent studies have introduced "smoothed policies" in order to enhance its robustness.
It is still challenging to establish a provable guarantee to certify the bound of its total reward.
arXiv Detail & Related papers (2023-12-11T15:07:58Z) - Enhancing Infrared Small Target Detection Robustness with Bi-Level
Adversarial Framework [61.34862133870934]
We propose a bi-level adversarial framework to promote the robustness of detection in the presence of distinct corruptions.
Our scheme remarkably improves 21.96% IOU across a wide array of corruptions and notably promotes 4.97% IOU on the general benchmark.
arXiv Detail & Related papers (2023-09-03T06:35:07Z) - Robust Reinforcement Learning with Distributional Risk-averse
formulation [1.2891210250935146]
We approximate the Robust Reinforcement Learning constrained with a $Phi$-divergence using an approximate Risk-Averse formulation.
We show that the classical Reinforcement Learning formulation can be robustified using standard deviation penalization of the objective.
arXiv Detail & Related papers (2022-06-14T13:33:58Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Off-Policy Reinforcement Learning with Delayed Rewards [16.914712720033524]
In many real-world tasks, instant rewards are not readily accessible or defined immediately after the agent performs actions.
In this work, we first formally define the environment with delayed rewards and discuss the challenges raised due to the non-Markovian nature of such environments.
We introduce a general off-policy RL framework with a new Q-function formulation that can handle the delayed rewards with theoretical convergence guarantees.
arXiv Detail & Related papers (2021-06-22T15:19:48Z) - Policy Smoothing for Provably Robust Reinforcement Learning [109.90239627115336]
We study the provable robustness of reinforcement learning against norm-bounded adversarial perturbations of the inputs.
We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial of perturbation the input.
arXiv Detail & Related papers (2021-06-21T21:42:08Z) - Learning Robust Feedback Policies from Demonstrations [9.34612743192798]
We propose and analyze a new framework to learn feedback control policies that exhibit provable guarantees on the closed-loop performance and robustness to bounded (adversarial) perturbations.
These policies are learned from expert demonstrations without any prior knowledge of the task, its cost function, and system dynamics.
arXiv Detail & Related papers (2021-03-30T19:11:05Z) - State Augmented Constrained Reinforcement Learning: Overcoming the
Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds.
We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards.
This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z) - Prediction with Corrupted Expert Advice [67.67399390910381]
We prove that a variant of the classical Multiplicative Weights algorithm with decreasing step sizes achieves constant regret in a benign environment.
Our results reveal a surprising disparity between the often comparable Follow the Regularized Leader (FTRL) and Online Mirror Descent (OMD) frameworks.
arXiv Detail & Related papers (2020-02-24T14:39:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.