Related papers: PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data

PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data

URL: http://arxiv.org/abs/2507.20068v1
Date: Sat, 26 Jul 2025 21:51:15 GMT
Title: PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data
Authors: Aishwarya Mandyam, Jason Meng, Ge Gao, Jiankai Sun, Mac Schwager, Barbara E. Engelhardt, Emma Brunskill,
Abstract summary: Off-policy evaluation (OPE) methods aim to estimate the value of a new reinforcement learning (RL) policy prior to deployment.<n>We propose two approaches to construct valid confidence intervals for OPE when using data augmentation.
Score: 36.6443700664411
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Off-policy evaluation (OPE) methods aim to estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of these value estimates. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation for OPE in RL lack principled uncertainty quantification. In high stakes settings like healthcare, reliable uncertainty estimates are important for comparing policy value estimates. In this work, we propose two approaches to construct valid confidence intervals for OPE when using data augmentation. The first provides a confidence interval over the policy performance conditioned on a particular initial state $V^{\pi}(s_0)$-- such intervals are particularly important for human-centered applications. To do so we introduce a new conformal prediction method for high dimensional state MDPs. Second, we consider the more common task of estimating the average policy performance over many initial states; to do so we draw on ideas from doubly robust estimation and prediction powered inference. Across simulators spanning robotics, healthcare and inventory management, and a real healthcare dataset from MIMIC-IV, we find that our methods can use augmented data and still consistently produce intervals that cover the ground truth values, unlike previously proposed methods.

Related papers

Off-Policy Evaluation and Learning for the Future under Non-Stationarity [18.657003350333298]
We study the novel problem of future off-policy evaluation (F-OPE) and learning (F-OPL)<n>Our goal is often to estimate and optimize the policy value for the upcoming month using data collected by an old policy in the previous month.<n>Existing methods assume stationarity or depend on restrictive reward-modeling assumptions, leading to significant bias.
arXiv Detail & Related papers (2025-06-25T13:31:46Z)
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation [62.2436697657307]
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data.<n>We propose a method called Stratified Prediction-Powered Inference (StratPPI)<n>We show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies.
arXiv Detail & Related papers (2024-06-06T17:37:39Z)
Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies. Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z)
Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z)
Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old Data in Nonstationary Environments [31.492146288630515]
We introduce a variant of the doubly robust (DR) estimator, called the regression-assisted DR estimator, that can incorporate the past data without introducing a large bias. We empirically show that the new estimator improves estimation for the current and future policy values, and provides a tight and valid interval estimation in several nonstationary recommendation environments.
arXiv Detail & Related papers (2023-02-23T01:17:21Z)
Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits [5.144809478361604]
We improve the doubly robust (DR) estimator by adaptively weighting observations to control its variance. We provide empirical evidence for our estimator's improved accuracy and inferential properties relative to existing alternatives.
arXiv Detail & Related papers (2021-06-03T17:54:44Z)
Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds [21.520045697447372]
Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies. This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation. We develop a practical algorithm through a primal-dual optimization-based approach.
arXiv Detail & Related papers (2021-03-09T22:31:20Z)
CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning. We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z)
Accountable Off-Policy Evaluation With Kernel Bellman Statistics [29.14119984573459]
We consider off-policy evaluation (OPE), which evaluates the performance of a new policy from observed data collected from previous experiments. Due to the limited information from off-policy data, it is highly desirable to construct rigorous confidence intervals, not just point estimation. We propose a new variational framework which reduces the problem of calculating tight confidence bounds in OPE.
arXiv Detail & Related papers (2020-08-15T07:24:38Z)
Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education. Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding. We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.