Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model
- URL: http://arxiv.org/abs/2210.09512v1
- Date: Sat, 15 Oct 2022 17:22:30 GMT
- Title: Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model
- Authors: Alexander Buchholz, Ben London, Giuseppe di Benedetto, Thorsten
Joachims
- Abstract summary: A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
- Score: 83.83064559894989
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A critical need for industrial recommender systems is the ability to evaluate
recommendation policies offline, before deploying them to production.
Unfortunately, widely used off-policy evaluation methods either make strong
assumptions about how users behave that can lead to excessive bias, or they
make fewer assumptions and suffer from large variance. We tackle this problem
by developing a new estimator that mitigates the problems of the two most
popular off-policy estimators for rankings, namely the position-based model and
the item-position model. In particular, the new estimator, called INTERPOL,
addresses the bias of a potentially misspecified position-based model, while
providing an adaptable bias-variance trade-off compared to the item-position
model. We provide theoretical arguments as well as empirical results that
highlight the performance of our novel estimation approach.
Related papers
- Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Counterfactual-Augmented Importance Sampling for Semi-Offline Policy
Evaluation [13.325600043256552]
We propose a semi-offline evaluation framework, where human users provide annotations of unobserved counterfactual trajectories.
Our framework, combined with principled human-centered design of annotation solicitation, can enable the application of reinforcement learning in high-stakes domains.
arXiv Detail & Related papers (2023-10-26T04:41:19Z) - Fine-Tuning Language Models with Advantage-Induced Policy Alignment [80.96507425217472]
We propose a novel algorithm for aligning large language models to human preferences.
We show that it consistently outperforms PPO in language tasks by a large margin.
We also provide a theoretical justification supporting the design of our loss function.
arXiv Detail & Related papers (2023-06-04T01:59:40Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Guide the Learner: Controlling Product of Experts Debiasing Method Based
on Token Attribution Similarities [17.082695183953486]
A popular workaround is to train a robust model by re-weighting training examples based on a secondary biased model.
Here, the underlying assumption is that the biased model resorts to shortcut features.
We introduce a fine-tuning strategy that incorporates the similarity between the main and biased model attribution scores in a Product of Experts loss function.
arXiv Detail & Related papers (2023-02-06T15:21:41Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Off-Policy Evaluation for Large Action Spaces via Embeddings [36.42838320396534]
Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems.
Existing OPE estimators degrade severely when the number of actions is large.
We propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space.
arXiv Detail & Related papers (2022-02-13T14:00:09Z) - Doubly Robust Off-Policy Evaluation for Ranking Policies under the
Cascade Behavior Model [11.101369123145588]
Off-policy evaluation for ranking policies enables performance estimation of new ranking policies using only logged data.
Previous studies introduce some assumptions on user behavior to make the item space tractable.
We propose the Cascade Doubly Robust estimator, which assumes that a user interacts with items sequentially from the top position in a ranking.
arXiv Detail & Related papers (2022-02-03T12:42:33Z) - Unifying Gradient Estimators for Meta-Reinforcement Learning via
Off-Policy Evaluation [53.83642844626703]
We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation.
Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
arXiv Detail & Related papers (2021-06-24T15:58:01Z) - On the model-based stochastic value gradient for continuous
reinforcement learning [50.085645237597056]
We show that simple model-based agents can outperform state-of-the-art model-free agents in terms of both sample-efficiency and final reward.
Our findings suggest that model-based policy evaluation deserves closer attention.
arXiv Detail & Related papers (2020-08-28T17:58:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.