Related papers: Active Offline Policy Selection

Active Offline Policy Selection

URL: http://arxiv.org/abs/2106.10251v1
Date: Fri, 18 Jun 2021 17:33:13 GMT
Title: Active Offline Policy Selection
Authors: Ksenia Konyushkova, Yutian Chen, Thomas Paine, Caglar Gulcehre, Cosmin Paduraru, Daniel J Mankowitz, Misha Denil, Nando de Freitas
Abstract summary: This paper addresses the problem of policy selection in domains with abundant logged data, but with a very restricted interaction budget. Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. We introduce a novel emphactive offline policy selection problem formulation, which combined logged data and limited online interactions to identify the best policy.
Score: 19.18251239758809
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper addresses the problem of policy selection in domains with abundant logged data, but with a very restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and healthcare domain among others. Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation in the real environment. To reduce this gap, we introduce a novel \emph{active offline policy selection} problem formulation, which combined logged data and limited online interactions to identify the best policy. We rely on the advances in OPE to warm start the evaluation. We build upon Bayesian optimization to iteratively decide which policies to evaluate in order to utilize the limited environment interactions wisely. Many candidate policies could be proposed, thus, we focus on making our approach scalable and introduce a kernel function to model similarity between policies. We use several benchmark environments to show that the proposed approach improves upon state-of-the-art OPE estimates and fully online policy evaluation with limited budget. Additionally, we show that each component of the proposed method is important, it works well with various number and quality of OPE estimates and even with a large number of candidate policies.

Related papers

Reinforcement Learning with Continuous Actions Under Unmeasured Confounding [14.510042451844766]
This paper addresses the challenge of offline policy learning in reinforcement learning with continuous action spaces. We develop a minimax estimator and introduce a policy-gradient-based algorithm to identify the in-class optimal policy. We provide theoretical results regarding the consistency, finite-sample error bound, and regret bound of the resulting optimal policy.
arXiv Detail & Related papers (2025-05-01T04:55:29Z)
OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators [13.408838970377035]
offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance. We propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure. Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
arXiv Detail & Related papers (2024-05-27T23:51:20Z)
Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data. Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z)
Offline Policy Evaluation and Optimization under Confounding [35.778917456294046]
We map out the landscape of offline policy evaluation for confounded MDPs. We characterize settings where consistent value estimates are provably not achievable. We present new algorithms for offline policy improvement and prove local convergence guarantees.
arXiv Detail & Related papers (2022-11-29T20:45:08Z)
Memory-Constrained Policy Optimization [59.63021433336966]
We introduce a new constrained optimization method for policy gradient reinforcement learning. We form a second trust region through the construction of another virtual policy that represents a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial in case the old policy performs badly.
arXiv Detail & Related papers (2022-04-20T08:50:23Z)
Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy. We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance. Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z)
Benchmarks for Deep Off-Policy Evaluation [152.28569758144022]
We present a collection of policies that can be used for benchmarking off-policy evaluation. The goal of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles. We provide open-source access to our data and code to foster future research in this area.
arXiv Detail & Related papers (2021-03-30T18:09:33Z)
Policy Optimization as Online Learning with Mediator Feedback [46.845765216238135]
Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space. We propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RIST) for regret minimization.
arXiv Detail & Related papers (2020-12-15T11:34:29Z)
Offline Policy Selection under Uncertainty [113.57441913299868]
We consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. Access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics. We show how BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric.
arXiv Detail & Related papers (2020-12-12T23:09:21Z)
A Practical Guide of Off-Policy Evaluation for Bandit Problems [13.607327477092877]
Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from samples obtained via different policies. We propose a meta-algorithm based on existing OPE estimators. We investigate the proposed concepts using synthetic and open real-world datasets in experiments.
arXiv Detail & Related papers (2020-10-23T15:11:19Z)
Non-Stationary Off-Policy Optimization [50.41335279896062]
We study the novel problem of off-policy optimization in piecewise-stationary contextual bandits. In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state. In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance.
arXiv Detail & Related papers (2020-06-15T09:16:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.