Related papers: Conservative Exploration in Reinforcement Learning

Conservative Exploration in Reinforcement Learning

URL: http://arxiv.org/abs/2002.03218v2
Date: Wed, 15 Jul 2020 12:51:27 GMT
Title: Conservative Exploration in Reinforcement Learning
Authors: Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta
Abstract summary: We introduce the notion of conservative exploration for average reward and finite horizon problems. We present two optimistic algorithms that guarantee (w.h.p.) that the conservative constraint is never violated during learning.
Score: 113.55554483194832
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While learning in an unknown Markov Decision Process (MDP), an agent should trade off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward. Although the agent will eventually learn a good or optimal policy, there is no guarantee on the quality of the intermediate policies. This lack of control is undesired in real-world applications where a minimum requirement is that the executed policies are guaranteed to perform at least as well as an existing baseline. In this paper, we introduce the notion of conservative exploration for average reward and finite horizon problems. We present two optimistic algorithms that guarantee (w.h.p.) that the conservative constraint is never violated during learning. We derive regret bounds showing that being conservative does not hinder the learning ability of these algorithms.

Related papers

Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation [4.837737516460689]
We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems.
arXiv Detail & Related papers (2023-12-24T10:59:32Z)
Mildly Conservative Q-Learning for Offline Reinforcement Learning [63.2183622958666]
offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. Existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values.
arXiv Detail & Related papers (2022-06-09T19:44:35Z)
Safe Exploration for Constrained Reinforcement Learning with Provable Guarantees [2.379828460137829]
We propose a model-based safe RL algorithm that we call the Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm. We show that it achieves an $tildemathcalO(S2sqrtA H7K/ (barC - barC_b)$ cumulative regret without violating the safety constraints during learning.
arXiv Detail & Related papers (2021-12-01T23:21:48Z)
Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning [50.44564503645015]
We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes. We prove tighter upper regret bounds for optimistic algorithms and accompany them with new information-theoretic lower bounds for a large class of MDPs.
arXiv Detail & Related papers (2021-07-02T20:36:05Z)
Safe Reinforcement Learning Using Advantage-Based Intervention [45.79740561754542]
Many sequential decision problems involve finding a policy that maximizes total reward while obeying safety constraints. We propose a new algorithm, SAILR, that uses an intervention mechanism based on advantage functions to keep the agent safe throughout training. Our method comes with strong guarantees on safety during both training and deployment.
arXiv Detail & Related papers (2021-06-16T20:28:56Z)
Revisiting Peng's Q($\lambda$) for Modern Reinforcement Learning [69.39357308375212]
Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms. Recent studies have shown that non-conservative algorithms outperform conservative ones.
arXiv Detail & Related papers (2021-02-27T02:29:01Z)
Constrained Markov Decision Processes via Backward Value Functions [43.649330976089004]
We model the problem of learning with constraints as a Constrained Markov Decision Process. A key contribution of our approach is to translate cumulative cost constraints into state-based constraints. We provide theoretical guarantees under which the agent converges while ensuring safety over the course of training.
arXiv Detail & Related papers (2020-08-26T20:56:16Z)
Improved Algorithms for Conservative Exploration in Bandits [113.55554483194832]
We study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2) We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems.
arXiv Detail & Related papers (2020-02-08T19:35:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.