DOLCE: Decomposing Off-Policy Evaluation/Learning into Lagged and Current Effects
- URL: http://arxiv.org/abs/2505.00961v2
- Date: Wed, 21 May 2025 11:24:02 GMT
- Title: DOLCE: Decomposing Off-Policy Evaluation/Learning into Lagged and Current Effects
- Authors: Shu Tamano, Masanori Nojima,
- Abstract summary: Off-policy evaluation (OPE) and off-policy learning (OPL) for contextual bandit policies leverage historical data to evaluate and optimize a target policy.<n>We propose DOLCE: Decomposing Off-policy evaluation/learning into Lagged and Current Effects, a novel estimator that leverages contextual information from multiple time points to decompose rewards into lagged and current effects.<n>Our experiments demonstrate that DOLCE achieves substantial improvements in OPE and OPL, particularly as the proportion of individuals outside the common support assumption increases.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Off-policy evaluation (OPE) and off-policy learning (OPL) for contextual bandit policies leverage historical data to evaluate and optimize a target policy. Most existing OPE/OPL methods--based on importance weighting or imputation--assume common support between the target and logging policies. When this assumption is violated, these methods typically require unstable extrapolation, truncation, or conservative strategies for individuals outside the common support assumption. However, such approaches can be inadequate in settings where explicit evaluation or optimization for such individuals is required. To address this issue, we propose DOLCE: Decomposing Off-policy evaluation/learning into Lagged and Current Effects, a novel estimator that leverages contextual information from multiple time points to decompose rewards into lagged and current effects. By incorporating both past and present contexts, DOLCE effectively handles individuals who violate the common support assumption. We show that the proposed estimator is unbiased under two assumptions--local correctness and conditional independence. Our experiments demonstrate that DOLCE achieves substantial improvements in OPE and OPL, particularly as the proportion of individuals outside the common support assumption increases.
Related papers
- $Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies [13.528097424046823]
We introduce $Deltatext-rm OPE$ methods based on the widely used Inverse Propensity Scoring estimator.
Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.
arXiv Detail & Related papers (2024-05-16T12:04:55Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Offline Recommender System Evaluation under Unobserved Confounding [5.4208903577329375]
Off-Policy Estimation methods allow us to learn and evaluate decision-making policies from logged data.
An important assumption that makes this work is the absence of unobserved confounders.
This work aims to highlight the problems that arise when performing off-policy estimation in the presence of unobserved confounders.
arXiv Detail & Related papers (2023-09-08T09:11:26Z) - Off-Policy Evaluation of Ranking Policies under Diverse User Behavior [25.226825574282937]
Inverse Propensity Scoring (IPS) becomes extremely inaccurate in the ranking setup due to its high variance under large action spaces.
This work explores a far more general formulation where user behavior is diverse and can vary depending on the user context.
We show that the resulting estimator, which we call Adaptive IPS (AIPS), can be unbiased under any complex user behavior.
arXiv Detail & Related papers (2023-06-26T22:31:15Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z) - Model-Free and Model-Based Policy Evaluation when Causality is Uncertain [7.858296711223292]
In off-policy evaluation, there may exist unobserved variables that both impact the dynamics and are used by the unknown behavior policy.
We develop worst-case bounds to assess sensitivity to these unobserved confounders in finite horizons.
We show that a model-based approach with robust MDPs gives sharper lower bounds by exploiting domain knowledge about the dynamics.
arXiv Detail & Related papers (2022-04-02T23:40:15Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.