Long-term Off-Policy Evaluation and Learning
- URL: http://arxiv.org/abs/2404.15691v1
- Date: Wed, 24 Apr 2024 06:59:59 GMT
- Title: Long-term Off-Policy Evaluation and Learning
- Authors: Yuta Saito, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, Mounia Lalmas,
- Abstract summary: Short- and long-term outcomes of an algorithm often differ, with damaging downstream effects.
It takes months or even longer to observe the long-term outcomes of interest, making the algorithm selection process unacceptably slow.
We propose a new framework called Long-term Off-Policy Evaluation (LOPE), which is based on reward function decomposition.
- Score: 21.047613223586794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Short- and long-term outcomes of an algorithm often differ, with damaging downstream effects. A known example is a click-bait algorithm, which may increase short-term clicks but damage long-term user engagement. A possible solution to estimate the long-term outcome is to run an online experiment or A/B test for the potential algorithms, but it takes months or even longer to observe the long-term outcomes of interest, making the algorithm selection process unacceptably slow. This work thus studies the problem of feasibly yet accurately estimating the long-term outcome of an algorithm using only historical and short-term experiment data. Existing approaches to this problem either need a restrictive assumption about the short-term outcomes called surrogacy or cannot effectively use short-term outcomes, which is inefficient. Therefore, we propose a new framework called Long-term Off-Policy Evaluation (LOPE), which is based on reward function decomposition. LOPE works under a more relaxed assumption than surrogacy and effectively leverages short-term rewards to substantially reduce the variance. Synthetic experiments show that LOPE outperforms existing approaches particularly when surrogacy is severely violated and the long-term reward is noisy. In addition, real-world experiments on large-scale A/B test data collected on a music streaming platform show that LOPE can estimate the long-term outcome of actual algorithms more accurately than existing feasible methods.
Related papers
- Policy Learning for Balancing Short-Term and Long-Term Rewards [11.859587700058235]
This paper formalizes a new framework for learning the optimal policy, where some long-term outcomes are allowed to be missing.
We show that short-term outcomes, if associated, contribute to improving the estimator of the long-term reward balances.
arXiv Detail & Related papers (2024-05-06T10:09:35Z) - Choosing a Proxy Metric from Past Experiments [54.338884612982405]
In many randomized experiments, the treatment effect of the long-term metric is often difficult or infeasible to measure.
A common alternative is to measure several short-term proxy metrics in the hope they closely track the long-term metric.
We introduce a new statistical framework to both define and construct an optimal proxy metric for use in a homogeneous population of randomized experiments.
arXiv Detail & Related papers (2023-09-14T17:43:02Z) - B-Learner: Quasi-Oracle Bounds on Heterogeneous Causal Effects Under
Hidden Confounding [51.74479522965712]
We propose a meta-learner called the B-Learner, which can efficiently learn sharp bounds on the CATE function under limits on hidden confounding.
We prove its estimates are valid, sharp, efficient, and have a quasi-oracle property with respect to the constituent estimators under more general conditions than existing methods.
arXiv Detail & Related papers (2023-04-20T18:07:19Z) - Estimating long-term causal effects from short-term experiments and
long-term observational data with unobserved confounding [5.854757988966379]
We study the identification and estimation of long-term treatment effects when both experimental and observational data are available.
Our long-term causal effect estimator is obtained by combining regression residuals with short-term experimental outcomes.
arXiv Detail & Related papers (2023-02-21T12:22:47Z) - A Reinforcement Learning Approach to Estimating Long-term Treatment
Effects [13.371851720834918]
A limitation with randomized experiments is that they do not easily extend to measure long-term effects.
We take a reinforcement learning (RL) approach that estimates the average reward in a Markov process.
Motivated by real-world scenarios where the observed state transition is nonstationary, we develop a new algorithm for a class of nonstationary problems.
arXiv Detail & Related papers (2022-10-14T05:33:19Z) - Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning [59.02006924867438]
Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions.
Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting.
We propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets.
arXiv Detail & Related papers (2022-02-19T20:00:44Z) - Long-term Causal Inference Under Persistent Confounding via Data Combination [38.026740610259225]
We study the identification and estimation of long-term treatment effects when both experimental and observational data are available.
Since the long-term outcome is observed only after a long delay, it is not measured in the experimental data, but only recorded in the observational data.
arXiv Detail & Related papers (2022-02-15T07:44:20Z) - Long-Term Effect Estimation with Surrogate Representation [43.932546958874696]
This work studies the problem of long-term effect where the outcome of primary interest, or primary outcome, takes months or even years to accumulate.
We propose to build connections between long-term causal inference and sequential models in machine learning.
arXiv Detail & Related papers (2020-08-19T03:16:18Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - Time-varying Gaussian Process Bandit Optimization with Non-constant
Evaluation Time [93.6788993843846]
We propose a novel time-varying Bayesian optimization algorithm that can effectively handle the non-constant evaluation time.
Our bound elucidates that a pattern of the evaluation time sequence can hugely affect the difficulty of the problem.
arXiv Detail & Related papers (2020-03-10T13:28:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.