Off-Policy Evaluation Using Information Borrowing and Context-Based Switching
- URL: http://arxiv.org/abs/2112.09865v2
- Date: Sun, 18 Aug 2024 05:51:04 GMT
- Title: Off-Policy Evaluation Using Information Borrowing and Context-Based Switching
- Authors: Sutanoy Dasgupta, Yabo Niu, Kishan Panaganti, Dileep Kalathil, Debdeep Pati, Bani Mallick,
- Abstract summary: We consider the off-policy evaluation problem in contextual bandits.
The goal is to estimate the value of a target policy using the data collected by a logging policy.
We propose a new approach called the Doubly Robust with Information borrowing and Context-based switching (DR-IC) estimator.
- Score: 10.063289291875247
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We consider the off-policy evaluation (OPE) problem in contextual bandits, where the goal is to estimate the value of a target policy using the data collected by a logging policy. Most popular approaches to the OPE are variants of the doubly robust (DR) estimator obtained by combining a direct method (DM) estimator and a correction term involving the inverse propensity score (IPS). Existing algorithms primarily focus on strategies to reduce the variance of the DR estimator arising from large IPS. We propose a new approach called the Doubly Robust with Information borrowing and Context-based switching (DR-IC) estimator that focuses on reducing both bias and variance. The DR-IC estimator replaces the standard DM estimator with a parametric reward model that borrows information from the 'closer' contexts through a correlation structure that depends on the IPS. The DR-IC estimator also adaptively interpolates between this modified DM estimator and a modified DR estimator based on a context-specific switching rule. We give provable guarantees on the performance of the DR-IC estimator. We also demonstrate the superior performance of the DR-IC estimator compared to the state-of-the-art OPE algorithms on a number of benchmark problems.
Related papers
- Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.
We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.
We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy [11.16777821381608]
We introduce a novel doubly-robust (DR) off-policy estimator for Markov decision processes, DRUnknown, designed for situations where both the logging policy and the value function are unknown.
The proposed estimator initially estimates the logging policy and then estimates the value function model by minimizing the variance of the estimator while considering the estimating effect of the logging policy.
arXiv Detail & Related papers (2024-04-02T10:42:44Z) - Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits [41.91108406329159]
Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing new policies using existing data without costly experimentation.
We introduce a new OPE estimator for contextual bandits, the Marginal Ratio (MR) estimator, which focuses on the shift in the marginal distribution of outcomes $Y$ instead of the policies themselves.
arXiv Detail & Related papers (2023-12-03T17:04:57Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Off-Policy Risk Assessment in Markov Decision Processes [15.225153671736201]
We develop the first doubly robust (DR) estimator for the CDF of returns in Markov decision processes (MDPs)
This estimator enjoys significantly less variance and, when the model is well specified, achieves the Cramer-Rao variance lower bound.
We derive the first minimax lower bounds for off-policy CDF and risk estimation, which match our error bounds up to a constant factor.
arXiv Detail & Related papers (2022-09-21T15:40:59Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Off-Policy Evaluation via Adaptive Weighting with Data from Contextual
Bandits [5.144809478361604]
We improve the doubly robust (DR) estimator by adaptively weighting observations to control its variance.
We provide empirical evidence for our estimator's improved accuracy and inferential properties relative to existing alternatives.
arXiv Detail & Related papers (2021-06-03T17:54:44Z) - Post-Contextual-Bandit Inference [57.88785630755165]
Contextual bandit algorithms are increasingly replacing non-adaptive A/B tests in e-commerce, healthcare, and policymaking.
They can both improve outcomes for study participants and increase the chance of identifying good or even best policies.
To support credible inference on novel interventions at the end of the study, we still want to construct valid confidence intervals on average treatment effects, subgroup effects, or value of new policies.
arXiv Detail & Related papers (2021-06-01T12:01:51Z) - Enhanced Doubly Robust Learning for Debiasing Post-click Conversion Rate
Estimation [29.27760413892272]
Post-click conversion, as a strong signal indicating the user preference, is salutary for building recommender systems.
Currently, most existing methods utilize counterfactual learning to debias recommender systems.
We propose a novel double learning approach for the MRDR estimator, which can convert the error imputation into the general CVR estimation.
arXiv Detail & Related papers (2021-05-28T06:59:49Z) - Self-supervised Representation Learning with Relative Predictive Coding [102.93854542031396]
Relative Predictive Coding (RPC) is a new contrastive representation learning objective.
RPC maintains a good balance among training stability, minibatch size sensitivity, and downstream task performance.
We empirically verify the effectiveness of RPC on benchmark vision and speech self-supervised learning tasks.
arXiv Detail & Related papers (2021-03-21T01:04:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.