Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy
Evaluation
- URL: http://arxiv.org/abs/2311.18207v3
- Date: Mon, 11 Mar 2024 00:29:24 GMT
- Title: Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy
Evaluation
- Authors: Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi,
Kazuhide Nakata, Yuta Saito
- Abstract summary: Off-Policy Evaluation (OPE) aims to assess the effectiveness of counterfactual policies using only offline logged data.
Existing evaluation metrics for OPE estimators primarily focus on the "accuracy" of OPE or that of downstream policy selection.
We develop a new metric, called SharpeRatio@k, which measures the risk-return tradeoff of policy portfolios formed by an OPE estimator.
- Score: 17.319113169622806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-Policy Evaluation (OPE) aims to assess the effectiveness of
counterfactual policies using only offline logged data and is often used to
identify the top-k promising policies for deployment in online A/B tests.
Existing evaluation metrics for OPE estimators primarily focus on the
"accuracy" of OPE or that of downstream policy selection, neglecting
risk-return tradeoff in the subsequent online policy deployment. To address
this issue, we draw inspiration from portfolio evaluation in finance and
develop a new metric, called SharpeRatio@k, which measures the risk-return
tradeoff of policy portfolios formed by an OPE estimator under varying online
evaluation budgets (k). We validate our metric in two example scenarios,
demonstrating its ability to effectively distinguish between low-risk and
high-risk estimators and to accurately identify the most efficient one.
Efficiency of an estimator is characterized by its capability to form the most
advantageous policy portfolios, maximizing returns while minimizing risks
during online deployment, a nuance that existing metrics typically overlook. To
facilitate a quick, accurate, and consistent evaluation of OPE via
SharpeRatio@k, we have also integrated this metric into an open-source
software, SCOPE-RL (https://github.com/hakuhodo-technologies/scope-rl).
Employing SharpeRatio@k and SCOPE-RL, we conduct comprehensive benchmarking
experiments on various estimators and RL tasks, focusing on their risk-return
tradeoff. These experiments offer several interesting directions and
suggestions for future OPE research.
Related papers
- Large Language Model driven Policy Exploration for Recommender Systems [50.70228564385797]
offline RL policies trained on static user data are vulnerable to distribution shift when deployed in dynamic online environments.
Online RL-based RS also face challenges in production deployment due to the risks of exposing users to untrained or unstable policies.
Large Language Models (LLMs) offer a promising solution to mimic user objectives and preferences for pre-training policies offline.
We propose an Interaction-Augmented Learned Policy (iALP) that utilizes user preferences distilled from an LLM.
arXiv Detail & Related papers (2025-01-23T16:37:44Z) - OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators [13.408838970377035]
offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance.
We propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure.
Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
arXiv Detail & Related papers (2024-05-27T23:51:20Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z) - Off-Policy Risk Assessment in Markov Decision Processes [15.225153671736201]
We develop the first doubly robust (DR) estimator for the CDF of returns in Markov decision processes (MDPs)
This estimator enjoys significantly less variance and, when the model is well specified, achieves the Cramer-Rao variance lower bound.
We derive the first minimax lower bounds for off-policy CDF and risk estimation, which match our error bounds up to a constant factor.
arXiv Detail & Related papers (2022-09-21T15:40:59Z) - A Risk-Sensitive Approach to Policy Optimization [21.684251937825234]
Standard deep reinforcement learning (DRL) aims to maximize expected reward, considering collected experiences equally in formulating a policy.
We propose a more direct approach whereby risk-sensitive objectives, specified in terms of the cumulative distribution function (CDF) of the distribution of full-episode rewards, are optimized.
We demonstrate that the use of moderately "pessimistic" risk profiles, which emphasize scenarios where the agent performs poorly, leads to enhanced exploration and a continual focus on addressing deficiencies.
arXiv Detail & Related papers (2022-08-19T00:55:05Z) - Data-Driven Off-Policy Estimator Selection: An Application in User
Marketing on An Online Content Delivery Service [11.986224119327387]
Off-policy evaluation is essential in domains such as healthcare, marketing or recommender systems.
Many OPE methods with theoretical backgrounds have been proposed.
It is often unknown for practitioners which estimator to use for their specific applications and purposes.
arXiv Detail & Related papers (2021-09-17T15:53:53Z) - Evaluating the Robustness of Off-Policy Evaluation [10.760026478889664]
Off-policy Evaluation (OPE) evaluates the performance of hypothetical policies leveraging only offline log data.
It is particularly useful in applications where the online interaction involves high stakes and expensive setting.
We develop Interpretable Evaluation for Offline Evaluation (IEOE), an experimental procedure to evaluate OPE estimators' robustness.
arXiv Detail & Related papers (2021-08-31T09:33:13Z) - Benchmarks for Deep Off-Policy Evaluation [152.28569758144022]
We present a collection of policies that can be used for benchmarking off-policy evaluation.
The goal of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles.
We provide open-source access to our data and code to foster future research in this area.
arXiv Detail & Related papers (2021-03-30T18:09:33Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.