Conservative Exploration in Reinforcement Learning
- URL: http://arxiv.org/abs/2002.03218v2
- Date: Wed, 15 Jul 2020 12:51:27 GMT
- Title: Conservative Exploration in Reinforcement Learning
- Authors: Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo
Pirotta
- Abstract summary: We introduce the notion of conservative exploration for average reward and finite horizon problems.
We present two optimistic algorithms that guarantee (w.h.p.) that the conservative constraint is never violated during learning.
- Score: 113.55554483194832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While learning in an unknown Markov Decision Process (MDP), an agent should
trade off exploration to discover new information about the MDP, and
exploitation of the current knowledge to maximize the reward. Although the
agent will eventually learn a good or optimal policy, there is no guarantee on
the quality of the intermediate policies. This lack of control is undesired in
real-world applications where a minimum requirement is that the executed
policies are guaranteed to perform at least as well as an existing baseline. In
this paper, we introduce the notion of conservative exploration for average
reward and finite horizon problems. We present two optimistic algorithms that
guarantee (w.h.p.) that the conservative constraint is never violated during
learning. We derive regret bounds showing that being conservative does not
hinder the learning ability of these algorithms.
Related papers
- Conservative Exploration for Policy Optimization via Off-Policy Policy
Evaluation [4.837737516460689]
We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy.
We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems.
arXiv Detail & Related papers (2023-12-24T10:59:32Z) - Mildly Conservative Q-Learning for Offline Reinforcement Learning [63.2183622958666]
offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment.
Existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic.
We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values.
arXiv Detail & Related papers (2022-06-09T19:44:35Z) - Safe Exploration for Constrained Reinforcement Learning with Provable
Guarantees [2.379828460137829]
We propose a model-based safe RL algorithm that we call the Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm.
We show that it achieves an $tildemathcalO(S2sqrtA H7K/ (barC - barC_b)$ cumulative regret without violating the safety constraints during learning.
arXiv Detail & Related papers (2021-12-01T23:21:48Z) - Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds
for Episodic Reinforcement Learning [50.44564503645015]
We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes.
We prove tighter upper regret bounds for optimistic algorithms and accompany them with new information-theoretic lower bounds for a large class of MDPs.
arXiv Detail & Related papers (2021-07-02T20:36:05Z) - Safe Reinforcement Learning Using Advantage-Based Intervention [45.79740561754542]
Many sequential decision problems involve finding a policy that maximizes total reward while obeying safety constraints.
We propose a new algorithm, SAILR, that uses an intervention mechanism based on advantage functions to keep the agent safe throughout training.
Our method comes with strong guarantees on safety during both training and deployment.
arXiv Detail & Related papers (2021-06-16T20:28:56Z) - Revisiting Peng's Q($\lambda$) for Modern Reinforcement Learning [69.39357308375212]
Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms.
Recent studies have shown that non-conservative algorithms outperform conservative ones.
arXiv Detail & Related papers (2021-02-27T02:29:01Z) - Constrained Markov Decision Processes via Backward Value Functions [43.649330976089004]
We model the problem of learning with constraints as a Constrained Markov Decision Process.
A key contribution of our approach is to translate cumulative cost constraints into state-based constraints.
We provide theoretical guarantees under which the agent converges while ensuring safety over the course of training.
arXiv Detail & Related papers (2020-08-26T20:56:16Z) - Improved Algorithms for Conservative Exploration in Bandits [113.55554483194832]
We study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2)
We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems.
arXiv Detail & Related papers (2020-02-08T19:35:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.