Conservative Exploration for Policy Optimization via Off-Policy Policy
Evaluation
- URL: http://arxiv.org/abs/2312.15458v1
- Date: Sun, 24 Dec 2023 10:59:32 GMT
- Title: Conservative Exploration for Policy Optimization via Off-Policy Policy
Evaluation
- Authors: Paul Daoudi, Mathias Formoso, Othman Gaizi, Achraf Azize, Evrard
Garcelon
- Abstract summary: We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy.
We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems.
- Score: 4.837737516460689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A precondition for the deployment of a Reinforcement Learning agent to a
real-world system is to provide guarantees on the learning process. While a
learning algorithm will eventually converge to a good policy, there are no
guarantees on the performance of the exploratory policies. We study the problem
of conservative exploration, where the learner must at least be able to
guarantee its performance is at least as good as a baseline policy. We propose
the first conservative provably efficient model-free algorithm for policy
optimization in continuous finite-horizon problems. We leverage importance
sampling techniques to counterfactually evaluate the conservative condition
from the data self-generated by the algorithm. We derive a regret bound and
show that (w.h.p.) the conservative constraint is never violated during
learning. Finally, we leverage these insights to build a general schema for
conservative exploration in DeepRL via off-policy policy evaluation techniques.
We show empirically the effectiveness of our methods.
Related papers
- SaVeR: Optimal Data Collection Strategy for Safe Policy Evaluation in Tabular MDP [9.71834921109414]
We study safe data collection for the purpose of policy evaluation in tabular Markov decision processes (MDPs)
We first show that there exists a class of intractable MDPs where no safe oracle algorithm with knowledge about problem parameters can efficiently collect data and satisfy the safety constraints.
We then introduce an algorithm SaVeR for this problem that approximates the safe oracle algorithm and bound the finite-sample mean squared error of the algorithm while ensuring it satisfies the safety constraint.
arXiv Detail & Related papers (2024-06-04T09:54:55Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Value Enhancement of Reinforcement Learning via Efficient and Robust
Trust Region Optimization [14.028916306297928]
Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy.
We propose a novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms.
arXiv Detail & Related papers (2023-01-05T18:43:40Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Counterfactual Learning with General Data-generating Policies [3.441021278275805]
We develop an OPE method for a class of full support and deficient support logging policies in contextual-bandit settings.
We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases.
arXiv Detail & Related papers (2022-12-04T21:07:46Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Revisiting Peng's Q($\lambda$) for Modern Reinforcement Learning [69.39357308375212]
Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms.
Recent studies have shown that non-conservative algorithms outperform conservative ones.
arXiv Detail & Related papers (2021-02-27T02:29:01Z) - Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications.
Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment.
In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z) - Conservative Exploration in Reinforcement Learning [113.55554483194832]
We introduce the notion of conservative exploration for average reward and finite horizon problems.
We present two optimistic algorithms that guarantee (w.h.p.) that the conservative constraint is never violated during learning.
arXiv Detail & Related papers (2020-02-08T19:09:51Z) - Preventing Imitation Learning with Adversarial Policy Ensembles [79.81807680370677]
Imitation learning can reproduce policies by observing experts, which poses a problem regarding policy privacy.
How can we protect against external observers cloning our proprietary policies?
We introduce a new reinforcement learning framework, where we train an ensemble of near-optimal policies.
arXiv Detail & Related papers (2020-01-31T01:57:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.