Benchmarks for Deep Off-Policy Evaluation
- URL: http://arxiv.org/abs/2103.16596v1
- Date: Tue, 30 Mar 2021 18:09:33 GMT
- Title: Benchmarks for Deep Off-Policy Evaluation
- Authors: Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, Ziyu Wang,
Alexander Novikov, Mengjiao Yang, Michael R. Zhang, Yutian Chen, Aviral
Kumar, Cosmin Paduraru, Sergey Levine, Tom Le Paine
- Abstract summary: We present a collection of policies that can be used for benchmarking off-policy evaluation.
The goal of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles.
We provide open-source access to our data and code to foster future research in this area.
- Score: 152.28569758144022
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Off-policy evaluation (OPE) holds the promise of being able to leverage
large, offline datasets for both evaluating and selecting complex policies for
decision making. The ability to learn offline is particularly important in many
real-world domains, such as in healthcare, recommender systems, or robotics,
where online data collection is an expensive and potentially dangerous process.
Being able to accurately evaluate and select high-performing policies without
requiring online interaction could yield significant benefits in safety, time,
and cost for these applications. While many OPE methods have been proposed in
recent years, comparing results between papers is difficult because currently
there is a lack of a comprehensive and unified benchmark, and measuring
algorithmic progress has been challenging due to the lack of difficult
evaluation tasks. In order to address this gap, we present a collection of
policies that in conjunction with existing offline datasets can be used for
benchmarking off-policy evaluation. Our tasks include a range of challenging
high-dimensional continuous control problems, with wide selections of datasets
and policies for performing policy selection. The goal of our benchmark is to
provide a standardized measure of progress that is motivated from a set of
principles designed to challenge and test the limits of existing OPE methods.
We perform an evaluation of state-of-the-art algorithms and provide open-source
access to our data and code to foster future research in this area.
Related papers
- An Efficient Continuous Control Perspective for Reinforcement-Learning-based Sequential Recommendation [14.506332665769746]
We propose an underlinetextbfEfficient underlinetextbfContinuous underlinetextbfControl framework (ECoC)
Based on a statistically tested assumption, we first propose the novel unified action representation abstracted from normalized user and item spaces.
During this process, strategic exploration and directional control in terms of unified actions are carefully designed and crucial to final recommendation decisions.
arXiv Detail & Related papers (2024-08-15T09:26:26Z) - OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators [13.408838970377035]
offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance.
We propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure.
Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
arXiv Detail & Related papers (2024-05-27T23:51:20Z) - When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective [64.73162159837956]
evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging.
We propose DataCOPE, a data-centric framework for evaluating a target policy given a dataset.
Our empirical analysis of DataCOPE in the logged contextual bandit settings using healthcare datasets confirms its ability to evaluate both machine-learning and human expert policies.
arXiv Detail & Related papers (2023-11-23T17:13:37Z) - Efficient Online Reinforcement Learning with Offline Data [78.92501185886569]
We show that we can simply apply existing off-policy methods to leverage offline data when learning online.
We extensively ablate these design choices, demonstrating the key factors that most affect performance.
We see that correct application of these simple recommendations can provide a $mathbf2.5times$ improvement over existing approaches.
arXiv Detail & Related papers (2023-02-06T17:30:22Z) - Offline Policy Evaluation and Optimization under Confounding [35.778917456294046]
We map out the landscape of offline policy evaluation for confounded MDPs.
We characterize settings where consistent value estimates are provably not achievable.
We present new algorithms for offline policy improvement and prove local convergence guarantees.
arXiv Detail & Related papers (2022-11-29T20:45:08Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning [8.736154600219685]
Policy evaluation in online learning attracts increasing attention.
Yet, such a problem is particularly challenging due to the dependent data generated in the online environment.
We develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning.
arXiv Detail & Related papers (2021-10-29T02:38:54Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Active Offline Policy Selection [19.18251239758809]
This paper addresses the problem of policy selection in domains with abundant logged data, but with a very restricted interaction budget.
Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data.
We introduce a novel emphactive offline policy selection problem formulation, which combined logged data and limited online interactions to identify the best policy.
arXiv Detail & Related papers (2021-06-18T17:33:13Z) - MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch
Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical.
We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization.
Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.