Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits
- URL: http://arxiv.org/abs/2312.01457v1
- Date: Sun, 3 Dec 2023 17:04:57 GMT
- Title: Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits
- Authors: Muhammad Faaiz Taufiq, Arnaud Doucet, Rob Cornish, Jean-Francois Ton
- Abstract summary: Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing new policies using existing data without costly experimentation.
We introduce a new OPE estimator for contextual bandits, the Marginal Ratio (MR) estimator, which focuses on the shift in the marginal distribution of outcomes $Y$ instead of the policies themselves.
- Score: 41.91108406329159
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing
new policies using existing data without costly experimentation. However,
current OPE methods, such as Inverse Probability Weighting (IPW) and Doubly
Robust (DR) estimators, suffer from high variance, particularly in cases of low
overlap between target and behavior policies or large action and context
spaces. In this paper, we introduce a new OPE estimator for contextual bandits,
the Marginal Ratio (MR) estimator, which focuses on the shift in the marginal
distribution of outcomes $Y$ instead of the policies themselves. Through
rigorous theoretical analysis, we demonstrate the benefits of the MR estimator
compared to conventional methods like IPW and DR in terms of variance
reduction. Additionally, we establish a connection between the MR estimator and
the state-of-the-art Marginalized Inverse Propensity Score (MIPS) estimator,
proving that MR achieves lower variance among a generalized family of MIPS
estimators. We further illustrate the utility of the MR estimator in causal
inference settings, where it exhibits enhanced performance in estimating
Average Treatment Effects (ATE). Our experiments on synthetic and real-world
datasets corroborate our theoretical findings and highlight the practical
advantages of the MR estimator in OPE for contextual bandits.
Related papers
- Doubly Robust Estimator for Off-Policy Evaluation with Large Action
Spaces [0.951828574518325]
We study Off-Policy Evaluation in contextual bandit settings with large action spaces.
benchmark estimators suffer from severe bias and variance tradeoffs.
We propose a Marginalized Doubly Robust (MDR) estimator to overcome these limitations.
arXiv Detail & Related papers (2023-08-07T10:00:07Z) - Off-Policy Evaluation for Large Action Spaces via Conjunct Effect
Modeling [30.835774920236872]
We study off-policy evaluation of contextual bandit policies for large discrete action spaces.
We propose a new estimator, called OffCEM, that is based on the conjunct effect model (CEM), a novel decomposition of the causal effect into a cluster effect and a residual effect.
Experiments demonstrate that OffCEM provides substantial improvements in OPE especially in the presence of many actions.
arXiv Detail & Related papers (2023-05-14T04:16:40Z) - A Tale of Sampling and Estimation in Discounted Reinforcement Learning [50.43256303670011]
We present a minimax lower bound on the discounted mean estimation problem.
We show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties.
arXiv Detail & Related papers (2023-04-11T09:13:17Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Off-Policy Risk Assessment in Markov Decision Processes [15.225153671736201]
We develop the first doubly robust (DR) estimator for the CDF of returns in Markov decision processes (MDPs)
This estimator enjoys significantly less variance and, when the model is well specified, achieves the Cramer-Rao variance lower bound.
We derive the first minimax lower bounds for off-policy CDF and risk estimation, which match our error bounds up to a constant factor.
arXiv Detail & Related papers (2022-09-21T15:40:59Z) - Monotonic Improvement Guarantees under Non-stationarity for
Decentralized PPO [66.5384483339413]
We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL)
We show that a trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training.
arXiv Detail & Related papers (2022-01-31T20:39:48Z) - Off-Policy Evaluation Using Information Borrowing and Context-Based Switching [10.063289291875247]
We consider the off-policy evaluation problem in contextual bandits.
The goal is to estimate the value of a target policy using the data collected by a logging policy.
We propose a new approach called the Doubly Robust with Information borrowing and Context-based switching (DR-IC) estimator.
arXiv Detail & Related papers (2021-12-18T07:38:24Z) - Tight Mutual Information Estimation With Contrastive Fenchel-Legendre
Optimization [69.07420650261649]
We introduce a novel, simple, and powerful contrastive MI estimator named as FLO.
Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently.
The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.
arXiv Detail & Related papers (2021-07-02T15:20:41Z) - Minimax Off-Policy Evaluation for Multi-Armed Bandits [58.7013651350436]
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards.
We develop minimax rate-optimal procedures under three settings.
arXiv Detail & Related papers (2021-01-19T18:55:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.