Doubly Robust Off-Policy Evaluation for Ranking Policies under the
Cascade Behavior Model
- URL: http://arxiv.org/abs/2202.01562v1
- Date: Thu, 3 Feb 2022 12:42:33 GMT
- Title: Doubly Robust Off-Policy Evaluation for Ranking Policies under the
Cascade Behavior Model
- Authors: Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita,
Nobuyuki Shimizu, Yasuo Yamamoto
- Abstract summary: Off-policy evaluation for ranking policies enables performance estimation of new ranking policies using only logged data.
Previous studies introduce some assumptions on user behavior to make the item space tractable.
We propose the Cascade Doubly Robust estimator, which assumes that a user interacts with items sequentially from the top position in a ranking.
- Score: 11.101369123145588
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In real-world recommender systems and search engines, optimizing ranking
decisions to present a ranked list of relevant items is critical. Off-policy
evaluation (OPE) for ranking policies is thus gaining a growing interest
because it enables performance estimation of new ranking policies using only
logged data. Although OPE in contextual bandits has been studied extensively,
its naive application to the ranking setting faces a critical variance issue
due to the huge item space. To tackle this problem, previous studies introduce
some assumptions on user behavior to make the combinatorial item space
tractable. However, an unrealistic assumption may, in turn, cause serious bias.
Therefore, appropriately controlling the bias-variance tradeoff by imposing a
reasonable assumption is the key for success in OPE of ranking policies. To
achieve a well-balanced bias-variance tradeoff, we propose the Cascade Doubly
Robust estimator building on the cascade assumption, which assumes that a user
interacts with items sequentially from the top position in a ranking. We show
that the proposed estimator is unbiased in more cases compared to existing
estimators that make stronger assumptions. Furthermore, compared to a previous
estimator based on the same cascade assumption, the proposed estimator reduces
the variance by leveraging a control variate. Comprehensive experiments on both
synthetic and real-world data demonstrate that our estimator leads to more
accurate OPE than existing estimators in a variety of settings.
Related papers
- Rate-Optimal Rank Aggregation with Private Pairwise Rankings [12.511220449652384]
We address the challenge of preserving privacy while ensuring the utility of rank aggregation based on pairwise rankings.
Motivated by this, we propose adaptively debiasing the rankings from the randomized response mechanism.
arXiv Detail & Related papers (2024-02-26T18:05:55Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Off-Policy Evaluation of Ranking Policies under Diverse User Behavior [25.226825574282937]
Inverse Propensity Scoring (IPS) becomes extremely inaccurate in the ranking setup due to its high variance under large action spaces.
This work explores a far more general formulation where user behavior is diverse and can vary depending on the user context.
We show that the resulting estimator, which we call Adaptive IPS (AIPS), can be unbiased under any complex user behavior.
arXiv Detail & Related papers (2023-06-26T22:31:15Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z) - Control Variates for Slate Off-Policy Evaluation [112.35528337130118]
We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions.
We obtain new estimators with risk improvement guarantees over both the PI and self-normalized PI estimators.
arXiv Detail & Related papers (2021-06-15T06:59:53Z) - Universal Off-Policy Evaluation [64.02853483874334]
We take the first steps towards a universal off-policy estimator (UnO)
We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns.
arXiv Detail & Related papers (2021-04-26T18:54:31Z) - SetRank: A Setwise Bayesian Approach for Collaborative Ranking from
Implicit Feedback [50.13745601531148]
We propose a novel setwise Bayesian approach for collaborative ranking, namely SetRank, to accommodate the characteristics of implicit feedback in recommender system.
Specifically, SetRank aims at maximizing the posterior probability of novel setwise preference comparisons.
We also present the theoretical analysis of SetRank to show that the bound of excess risk can be proportional to $sqrtM/N$.
arXiv Detail & Related papers (2020-02-23T06:40:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.