Off-Policy Evaluation for Large Action Spaces via Conjunct Effect
Modeling
- URL: http://arxiv.org/abs/2305.08062v2
- Date: Fri, 2 Jun 2023 20:52:40 GMT
- Title: Off-Policy Evaluation for Large Action Spaces via Conjunct Effect
Modeling
- Authors: Yuta Saito, Qingyang Ren, Thorsten Joachims
- Abstract summary: We study off-policy evaluation of contextual bandit policies for large discrete action spaces.
We propose a new estimator, called OffCEM, that is based on the conjunct effect model (CEM), a novel decomposition of the causal effect into a cluster effect and a residual effect.
Experiments demonstrate that OffCEM provides substantial improvements in OPE especially in the presence of many actions.
- Score: 30.835774920236872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study off-policy evaluation (OPE) of contextual bandit policies for large
discrete action spaces where conventional importance-weighting approaches
suffer from excessive variance. To circumvent this variance issue, we propose a
new estimator, called OffCEM, that is based on the conjunct effect model (CEM),
a novel decomposition of the causal effect into a cluster effect and a residual
effect. OffCEM applies importance weighting only to action clusters and
addresses the residual causal effect through model-based reward estimation. We
show that the proposed estimator is unbiased under a new condition, called
local correctness, which only requires that the residual-effect model preserves
the relative expected reward differences of the actions within each cluster. To
best leverage the CEM and local correctness, we also propose a new two-step
procedure for performing model-based estimation that minimizes bias in the
first step and variance in the second step. We find that the resulting OffCEM
estimator substantially improves bias and variance compared to a range of
conventional estimators. Experiments demonstrate that OffCEM provides
substantial improvements in OPE especially in the presence of many actions.
Related papers
- Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits [15.916834591090009]
We explore off-policy evaluation and learning in contextual bandits.
This setting is widespread in fields such as recommender systems and healthcare.
We introduce a concept of factored action space, which allows us to decompose each subset into binary indicators.
arXiv Detail & Related papers (2024-08-20T21:25:04Z) - Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits [41.91108406329159]
Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing new policies using existing data without costly experimentation.
We introduce a new OPE estimator for contextual bandits, the Marginal Ratio (MR) estimator, which focuses on the shift in the marginal distribution of outcomes $Y$ instead of the policies themselves.
arXiv Detail & Related papers (2023-12-03T17:04:57Z) - Doubly Robust Estimator for Off-Policy Evaluation with Large Action
Spaces [0.951828574518325]
We study Off-Policy Evaluation in contextual bandit settings with large action spaces.
benchmark estimators suffer from severe bias and variance tradeoffs.
We propose a Marginalized Doubly Robust (MDR) estimator to overcome these limitations.
arXiv Detail & Related papers (2023-08-07T10:00:07Z) - Mimicking Better by Matching the Approximate Action Distribution [48.95048003354255]
We introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations.
We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods.
arXiv Detail & Related papers (2023-06-16T12:43:47Z) - Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z) - Domain-Specific Risk Minimization for Out-of-Distribution Generalization [104.17683265084757]
We first establish a generalization bound that explicitly considers the adaptivity gap.
We propose effective gap estimation methods for guiding the selection of a better hypothesis for the target.
The other method is minimizing the gap directly by adapting model parameters using online target samples.
arXiv Detail & Related papers (2022-08-18T06:42:49Z) - Off-Policy Evaluation for Large Action Spaces via Embeddings [36.42838320396534]
Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems.
Existing OPE estimators degrade severely when the number of actions is large.
We propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space.
arXiv Detail & Related papers (2022-02-13T14:00:09Z) - Deconfounding Scores: Feature Representations for Causal Effect
Estimation with Weak Overlap [140.98628848491146]
We introduce deconfounding scores, which induce better overlap without biasing the target of estimation.
We show that deconfounding scores satisfy a zero-covariance condition that is identifiable in observed data.
In particular, we show that this technique could be an attractive alternative to standard regularizations.
arXiv Detail & Related papers (2021-04-12T18:50:11Z) - Off-Policy Evaluation via the Regularized Lagrangian [110.28927184857478]
Recently proposed distribution correction estimation (DICE) family of estimators has advanced the state of the art in off-policy evaluation from behavior-agnostic data.
In this paper, we unify these estimators as regularized Lagrangians of the same linear program.
We find that dual solutions offer greater flexibility in navigating the tradeoff between stability and estimation bias, and generally provide superior estimates in practice.
arXiv Detail & Related papers (2020-07-07T13:45:56Z) - GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications.
Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions.
The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.