Predicting Long Term Sequential Policy Value Using Softer Surrogates
- URL: http://arxiv.org/abs/2412.20638v2
- Date: Mon, 03 Feb 2025 02:11:14 GMT
- Title: Predicting Long Term Sequential Policy Value Using Softer Surrogates
- Authors: Hyunji Nam, Allen Nie, Ge Gao, Vasilis Syrgkanis, Emma Brunskill,
- Abstract summary: Off-policy policy evaluation estimates the outcome of a new policy using historical data collected from a different policy.
We show that our estimators can provide accurate predictions about the policy value only after observing 10% of the full horizon data.
- Score: 45.9831721774649
- License:
- Abstract: Off-policy policy evaluation (OPE) estimates the outcome of a new policy using historical data collected from a different policy. However, existing OPE methods cannot handle cases when the new policy introduces novel actions. This issue commonly occurs in real-world domains, like healthcare, as new drugs and treatments are continuously developed. Novel actions necessitate on-policy data collection, which can be burdensome and expensive if the outcome of interest takes a substantial amount of time to observe--for example, in multi-year clinical trials. This raises a key question of how to predict the long-term outcome of a policy after only observing its short-term effects? Though in general this problem is intractable, under some surrogacy conditions, the short-term on-policy data can be combined with the long-term historical data to make accurate predictions about the new policy's long-term value. In two simulated healthcare examples--HIV and sepsis management--we show that our estimators can provide accurate predictions about the policy value only after observing 10\% of the full horizon data. We also provide finite sample analysis of our doubly robust estimators.
Related papers
- Short-Long Policy Evaluation with Novel Actions [26.182640173932956]
We introduce a new setting for short-long policy evaluation for sequential decision making tasks.
Our proposed methods significantly outperform prior results on simulators of HIV treatment, kidney dialysis and battery charging.
We also demonstrate that our methods can be useful for applications in AI safety by quickly identifying when a new decision policy is likely to have substantially lower performance than past policies.
arXiv Detail & Related papers (2024-07-04T06:42:21Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Identification of Subgroups With Similar Benefits in Off-Policy Policy
Evaluation [60.71312668265873]
We develop a method to balance the need for personalization with confident predictions.
We show that our method can be used to form accurate predictions of heterogeneous treatment effects.
arXiv Detail & Related papers (2021-11-28T23:19:12Z) - Estimating the Long-Term Effects of Novel Treatments [22.67249938461999]
Policy makers typically face the problem of wanting to estimate the long-term effects of novel treatments.
We propose a surrogate based approach where we assume that the long-term effect is channeled through a multitude of available short-term proxies.
arXiv Detail & Related papers (2021-03-15T13:56:48Z) - Targeting for long-term outcomes [1.7205106391379026]
Decision makers often want to target interventions so as to maximize an outcome that is observed only in the long-term.
Here we build on the statistical surrogacy and policy learning literatures to impute the missing long-term outcomes.
We apply our approach in two large-scale proactive churn management experiments at The Boston Globe.
arXiv Detail & Related papers (2020-10-29T18:31:17Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.