Active Offline Policy Selection
- URL: http://arxiv.org/abs/2106.10251v1
- Date: Fri, 18 Jun 2021 17:33:13 GMT
- Title: Active Offline Policy Selection
- Authors: Ksenia Konyushkova, Yutian Chen, Thomas Paine, Caglar Gulcehre, Cosmin
Paduraru, Daniel J Mankowitz, Misha Denil, Nando de Freitas
- Abstract summary: This paper addresses the problem of policy selection in domains with abundant logged data, but with a very restricted interaction budget.
Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data.
We introduce a novel emphactive offline policy selection problem formulation, which combined logged data and limited online interactions to identify the best policy.
- Score: 19.18251239758809
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper addresses the problem of policy selection in domains with abundant
logged data, but with a very restricted interaction budget. Solving this
problem would enable safe evaluation and deployment of offline reinforcement
learning policies in industry, robotics, and healthcare domain among others.
Several off-policy evaluation (OPE) techniques have been proposed to assess the
value of policies using only logged data. However, there is still a big gap
between the evaluation by OPE and the full online evaluation in the real
environment. To reduce this gap, we introduce a novel \emph{active offline
policy selection} problem formulation, which combined logged data and limited
online interactions to identify the best policy. We rely on the advances in OPE
to warm start the evaluation. We build upon Bayesian optimization to
iteratively decide which policies to evaluate in order to utilize the limited
environment interactions wisely. Many candidate policies could be proposed,
thus, we focus on making our approach scalable and introduce a kernel function
to model similarity between policies. We use several benchmark environments to
show that the proposed approach improves upon state-of-the-art OPE estimates
and fully online policy evaluation with limited budget. Additionally, we show
that each component of the proposed method is important, it works well with
various number and quality of OPE estimates and even with a large number of
candidate policies.
Related papers
- OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators [13.408838970377035]
offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance.
We propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure.
Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
arXiv Detail & Related papers (2024-05-27T23:51:20Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Offline Policy Evaluation and Optimization under Confounding [35.778917456294046]
We map out the landscape of offline policy evaluation for confounded MDPs.
We characterize settings where consistent value estimates are provably not achievable.
We present new algorithms for offline policy improvement and prove local convergence guarantees.
arXiv Detail & Related papers (2022-11-29T20:45:08Z) - Memory-Constrained Policy Optimization [59.63021433336966]
We introduce a new constrained optimization method for policy gradient reinforcement learning.
We form a second trust region through the construction of another virtual policy that represents a wide range of past policies.
We then enforce the new policy to stay closer to the virtual policy, which is beneficial in case the old policy performs badly.
arXiv Detail & Related papers (2022-04-20T08:50:23Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Benchmarks for Deep Off-Policy Evaluation [152.28569758144022]
We present a collection of policies that can be used for benchmarking off-policy evaluation.
The goal of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles.
We provide open-source access to our data and code to foster future research in this area.
arXiv Detail & Related papers (2021-03-30T18:09:33Z) - Policy Optimization as Online Learning with Mediator Feedback [46.845765216238135]
Policy Optimization (PO) is a widely used approach to address continuous control tasks.
In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space.
We propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RIST) for regret minimization.
arXiv Detail & Related papers (2020-12-15T11:34:29Z) - Offline Policy Selection under Uncertainty [113.57441913299868]
We consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset.
Access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics.
We show how BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric.
arXiv Detail & Related papers (2020-12-12T23:09:21Z) - A Practical Guide of Off-Policy Evaluation for Bandit Problems [13.607327477092877]
Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from samples obtained via different policies.
We propose a meta-algorithm based on existing OPE estimators.
We investigate the proposed concepts using synthetic and open real-world datasets in experiments.
arXiv Detail & Related papers (2020-10-23T15:11:19Z) - Non-Stationary Off-Policy Optimization [50.41335279896062]
We study the novel problem of off-policy optimization in piecewise-stationary contextual bandits.
In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state.
In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance.
arXiv Detail & Related papers (2020-06-15T09:16:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.