Short-Long Policy Evaluation with Novel Actions
- URL: http://arxiv.org/abs/2407.03674v2
- Date: Tue, 9 Jul 2024 18:05:10 GMT
- Title: Short-Long Policy Evaluation with Novel Actions
- Authors: Hyunji Alex Nam, Yash Chandak, Emma Brunskill,
- Abstract summary: We introduce a new setting for short-long policy evaluation for sequential decision making tasks.
Our proposed methods significantly outperform prior results on simulators of HIV treatment, kidney dialysis and battery charging.
We also demonstrate that our methods can be useful for applications in AI safety by quickly identifying when a new decision policy is likely to have substantially lower performance than past policies.
- Score: 26.182640173932956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: From incorporating LLMs in education, to identifying new drugs and improving ways to charge batteries, innovators constantly try new strategies in search of better long-term outcomes for students, patients and consumers. One major bottleneck in this innovation cycle is the amount of time it takes to observe the downstream effects of a decision policy that incorporates new interventions. The key question is whether we can quickly evaluate long-term outcomes of a new decision policy without making long-term observations. Organizations often have access to prior data about past decision policies and their outcomes, evaluated over the full horizon of interest. Motivated by this, we introduce a new setting for short-long policy evaluation for sequential decision making tasks. Our proposed methods significantly outperform prior results on simulators of HIV treatment, kidney dialysis and battery charging. We also demonstrate that our methods can be useful for applications in AI safety by quickly identifying when a new decision policy is likely to have substantially lower performance than past policies.
Related papers
- Off-Policy Evaluation and Counterfactual Methods in Dynamic Auction Environments [0.6445605125467574]
Off-Policy Evaluation allows researchers to assess new policies without costly experiments, speeding up the evaluation process.
Online experimental methods, such as A/B tests, are effective but often slow, thus delaying the policy selection and optimization process.
By utilizing counterfactual estimators as a preliminary step before conducting A/B tests, we aim to streamline the evaluation process.
arXiv Detail & Related papers (2025-01-09T14:39:40Z) - Predicting Long Term Sequential Policy Value Using Softer Surrogates [45.9831721774649]
Off-policy policy evaluation estimates the outcome of a new policy using historical data collected from a different policy.
We show that our estimators can provide accurate predictions about the policy value only after observing 10% of the full horizon data.
arXiv Detail & Related papers (2024-12-30T01:01:15Z) - OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators [13.408838970377035]
offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance.
We propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure.
Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
arXiv Detail & Related papers (2024-05-27T23:51:20Z) - Reduced-Rank Multi-objective Policy Learning and Optimization [57.978477569678844]
In practice, causal researchers do not have a single outcome in mind a priori.
In government-assisted social benefit programs, policymakers collect many outcomes to understand the multidimensional nature of poverty.
We present a data-driven dimensionality-reduction methodology for multiple outcomes in the context of optimal policy learning.
arXiv Detail & Related papers (2024-04-29T08:16:30Z) - On the Value of Myopic Behavior in Policy Reuse [67.37788288093299]
Leveraging learned strategies in unfamiliar scenarios is fundamental to human intelligence.
In this work, we present a framework called Selective Myopic bEhavior Control(SMEC)
SMEC adaptively aggregates the sharable short-term behaviors of prior policies and the long-term behaviors of the task policy, leading to coordinated decisions.
arXiv Detail & Related papers (2023-05-28T03:59:37Z) - Evaluating COVID-19 vaccine allocation policies using Bayesian $m$-top
exploration [53.122045119395594]
We present a novel technique for evaluating vaccine allocation strategies using a multi-armed bandit framework.
$m$-top exploration allows the algorithm to learn $m$ policies for which it expects the highest utility.
We consider the Belgian COVID-19 epidemic using the individual-based model STRIDE, where we learn a set of vaccination policies.
arXiv Detail & Related papers (2023-01-30T12:22:30Z) - Identification of Subgroups With Similar Benefits in Off-Policy Policy
Evaluation [60.71312668265873]
We develop a method to balance the need for personalization with confident predictions.
We show that our method can be used to form accurate predictions of heterogeneous treatment effects.
arXiv Detail & Related papers (2021-11-28T23:19:12Z) - Targeting for long-term outcomes [1.7205106391379026]
Decision makers often want to target interventions so as to maximize an outcome that is observed only in the long-term.
Here we build on the statistical surrogacy and policy learning literatures to impute the missing long-term outcomes.
We apply our approach in two large-scale proactive churn management experiments at The Boston Globe.
arXiv Detail & Related papers (2020-10-29T18:31:17Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z) - Off-policy Policy Evaluation For Sequential Decisions Under Unobserved
Confounding [33.58862183373374]
We assess robustness of OPE methods under unobserved confounding.
We show that even small amounts of per-decision confounding can heavily bias OPE methods.
We propose an efficient loss-minimization-based procedure for computing worst-case bounds.
arXiv Detail & Related papers (2020-03-12T05:20:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.