Off-Policy Evaluation for Large Action Spaces via Embeddings
- URL: http://arxiv.org/abs/2202.06317v1
- Date: Sun, 13 Feb 2022 14:00:09 GMT
- Title: Off-Policy Evaluation for Large Action Spaces via Embeddings
- Authors: Yuta Saito and Thorsten Joachims
- Abstract summary: Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems.
Existing OPE estimators degrade severely when the number of actions is large.
We propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space.
- Score: 36.42838320396534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in
real-world systems, since it enables offline evaluation of new policies using
only historic log data. Unfortunately, when the number of actions is large,
existing OPE estimators -- most of which are based on inverse propensity score
weighting -- degrade severely and can suffer from extreme bias and variance.
This foils the use of OPE in many applications from recommender systems to
language models. To overcome this issue, we propose a new OPE estimator that
leverages marginalized importance weights when action embeddings provide
structure in the action space. We characterize the bias, variance, and mean
squared error of the proposed estimator and analyze the conditions under which
the action embedding provides statistical benefits over conventional
estimators. In addition to the theoretical analysis, we find that the empirical
performance improvement can be substantial, enabling reliable OPE even when
existing estimators collapse due to a large number of actions.
Related papers
- Causal Deepsets for Off-policy Evaluation under Spatial or Spatio-temporal Interferences [24.361550505778155]
Offcommerce evaluation (OPE) is widely applied in sectors such as pharmaceuticals and e-policy-policy.
This paper introduces a causal deepset framework that relaxes several key structural assumptions.
We present novel algorithms that incorporate the PI assumption into OPE and thoroughly examine their theoretical foundations.
arXiv Detail & Related papers (2024-07-25T10:02:11Z) - Off-Policy Evaluation of Slate Bandit Policies via Optimizing
Abstraction [22.215852332444907]
We study the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates.
The typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces.
We develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space.
arXiv Detail & Related papers (2024-02-03T14:38:09Z) - Counterfactual-Augmented Importance Sampling for Semi-Offline Policy
Evaluation [13.325600043256552]
We propose a semi-offline evaluation framework, where human users provide annotations of unobserved counterfactual trajectories.
Our framework, combined with principled human-centered design of annotation solicitation, can enable the application of reinforcement learning in high-stakes domains.
arXiv Detail & Related papers (2023-10-26T04:41:19Z) - Off-Policy Evaluation for Large Action Spaces via Conjunct Effect
Modeling [30.835774920236872]
We study off-policy evaluation of contextual bandit policies for large discrete action spaces.
We propose a new estimator, called OffCEM, that is based on the conjunct effect model (CEM), a novel decomposition of the causal effect into a cluster effect and a residual effect.
Experiments demonstrate that OffCEM provides substantial improvements in OPE especially in the presence of many actions.
arXiv Detail & Related papers (2023-05-14T04:16:40Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z) - Unifying Gradient Estimators for Meta-Reinforcement Learning via
Off-Policy Evaluation [53.83642844626703]
We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation.
Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
arXiv Detail & Related papers (2021-06-24T15:58:01Z) - Off-Policy Evaluation via the Regularized Lagrangian [110.28927184857478]
Recently proposed distribution correction estimation (DICE) family of estimators has advanced the state of the art in off-policy evaluation from behavior-agnostic data.
In this paper, we unify these estimators as regularized Lagrangians of the same linear program.
We find that dual solutions offer greater flexibility in navigating the tradeoff between stability and estimation bias, and generally provide superior estimates in practice.
arXiv Detail & Related papers (2020-07-07T13:45:56Z) - GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications.
Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions.
The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.