Safe Evaluation For Offline Learning: Are We Ready To Deploy?
- URL: http://arxiv.org/abs/2212.08302v1
- Date: Fri, 16 Dec 2022 06:43:16 GMT
- Title: Safe Evaluation For Offline Learning: Are We Ready To Deploy?
- Authors: Hager Radi, Josiah P. Hanna, Peter Stone, Matthew E. Taylor
- Abstract summary: We introduce a framework for safe evaluation of offline learning using approximate high-confidence off-policy evaluation.
A lower-bound estimate tells us how good a newly-learned target policy would perform before it is deployed in the real environment.
- Score: 47.331520779610535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The world currently offers an abundance of data in multiple domains, from
which we can learn reinforcement learning (RL) policies without further
interaction with the environment. RL agents learning offline from such data is
possible but deploying them while learning might be dangerous in domains where
safety is critical. Therefore, it is essential to find a way to estimate how a
newly-learned agent will perform if deployed in the target environment before
actually deploying it and without the risk of overestimating its true
performance. To achieve this, we introduce a framework for safe evaluation of
offline learning using approximate high-confidence off-policy evaluation
(HCOPE) to estimate the performance of offline policies during learning. In our
setting, we assume a source of data, which we split into a train-set, to learn
an offline policy, and a test-set, to estimate a lower-bound on the offline
policy using off-policy evaluation with bootstrapping. A lower-bound estimate
tells us how good a newly-learned target policy would perform before it is
deployed in the real environment, and therefore allows us to decide when to
deploy our learned policy.
Related papers
- Bayesian Design Principles for Offline-to-Online Reinforcement Learning [50.97583504192167]
offline-to-online fine-tuning is crucial for real-world applications where exploration can be costly or unsafe.
In this paper, we tackle the dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop.
We show that Bayesian design principles are crucial in solving such a dilemma.
arXiv Detail & Related papers (2024-05-31T16:31:07Z) - Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate
Exploration Bias [96.14064037614942]
offline retraining, a policy extraction step at the end of online fine-tuning, is proposed.
An optimistic (exploration) policy is used to interact with the environment, and a separate pessimistic (exploitation) policy is trained on all the observed data for evaluation.
arXiv Detail & Related papers (2023-10-12T17:50:09Z) - Dealing with the Unknown: Pessimistic Offline Reinforcement Learning [25.30634466168587]
We propose a Pessimistic Offline Reinforcement Learning (PessORL) algorithm to actively lead the agent back to the area where it is familiar.
We focus on problems caused by out-of-distribution (OOD) states, and deliberately penalize high values at states that are absent in the training dataset.
arXiv Detail & Related papers (2021-11-09T22:38:58Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning [8.736154600219685]
Policy evaluation in online learning attracts increasing attention.
Yet, such a problem is particularly challenging due to the dependent data generated in the online environment.
We develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning.
arXiv Detail & Related papers (2021-10-29T02:38:54Z) - Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety
Constraints in Finite MDPs [71.47895794305883]
We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning setting.
We present an SPI for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals.
arXiv Detail & Related papers (2021-05-31T21:04:21Z) - MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch
Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical.
We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization.
Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.