Off-Policy Evaluation of Ranking Policies under Diverse User Behavior
- URL: http://arxiv.org/abs/2306.15098v1
- Date: Mon, 26 Jun 2023 22:31:15 GMT
- Title: Off-Policy Evaluation of Ranking Policies under Diverse User Behavior
- Authors: Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu,
Yasuo Yamamoto, Yuta Saito
- Abstract summary: Inverse Propensity Scoring (IPS) becomes extremely inaccurate in the ranking setup due to its high variance under large action spaces.
This work explores a far more general formulation where user behavior is diverse and can vary depending on the user context.
We show that the resulting estimator, which we call Adaptive IPS (AIPS), can be unbiased under any complex user behavior.
- Score: 25.226825574282937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ranking interfaces are everywhere in online platforms. There is thus an ever
growing interest in their Off-Policy Evaluation (OPE), aiming towards an
accurate performance evaluation of ranking policies using logged data. A
de-facto approach for OPE is Inverse Propensity Scoring (IPS), which provides
an unbiased and consistent value estimate. However, it becomes extremely
inaccurate in the ranking setup due to its high variance under large action
spaces. To deal with this problem, previous studies assume either independent
or cascade user behavior, resulting in some ranking versions of IPS. While
these estimators are somewhat effective in reducing the variance, all existing
estimators apply a single universal assumption to every user, causing excessive
bias and variance. Therefore, this work explores a far more general formulation
where user behavior is diverse and can vary depending on the user context. We
show that the resulting estimator, which we call Adaptive IPS (AIPS), can be
unbiased under any complex user behavior. Moreover, AIPS achieves the minimum
variance among all unbiased estimators based on IPS. We further develop a
procedure to identify the appropriate user behavior model to minimize the mean
squared error (MSE) of AIPS in a data-driven fashion. Extensive experiments
demonstrate that the empirical accuracy improvement can be significant,
enabling effective OPE of ranking systems even under diverse user behavior.
Related papers
- Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Inverse Propensity Score based offline estimator for deterministic
ranking lists using position bias [0.1269104766024433]
We present a novel way of computing IPS using a position-bias model for deterministic logging policies.
We validate this technique using two different experiments on industry-scale data.
arXiv Detail & Related papers (2022-08-31T17:32:04Z) - Cross Pairwise Ranking for Unbiased Item Recommendation [57.71258289870123]
We develop a new learning paradigm named Cross Pairwise Ranking (CPR)
CPR achieves unbiased recommendation without knowing the exposure mechanism.
We prove in theory that this way offsets the influence of user/item propensity on the learning.
arXiv Detail & Related papers (2022-04-26T09:20:27Z) - Doubly-Robust Estimation for Unbiased Learning-to-Rank from
Position-Biased Click Feedback [13.579420996461439]
We introduce a novel DR estimator that uses the expectation of treatment per rank instead of IPS estimation.
Our results indicate it requires several orders of magnitude fewer datapoints to converge at optimal performance.
arXiv Detail & Related papers (2022-03-31T15:38:25Z) - Doubly Robust Off-Policy Evaluation for Ranking Policies under the
Cascade Behavior Model [11.101369123145588]
Off-policy evaluation for ranking policies enables performance estimation of new ranking policies using only logged data.
Previous studies introduce some assumptions on user behavior to make the item space tractable.
We propose the Cascade Doubly Robust estimator, which assumes that a user interacts with items sequentially from the top position in a ranking.
arXiv Detail & Related papers (2022-02-03T12:42:33Z) - Correcting the User Feedback-Loop Bias for Recommendation Systems [34.44834423714441]
We propose a systematic and dynamic way to correct user feedback-loop bias in recommendation systems.
Our method includes a deep-learning component to learn each user's dynamic rating history embedding.
We empirically validated the existence of such user feedback-loop bias in real world recommendation systems.
arXiv Detail & Related papers (2021-09-13T15:02:55Z) - Control Variates for Slate Off-Policy Evaluation [112.35528337130118]
We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions.
We obtain new estimators with risk improvement guarantees over both the PI and self-normalized PI estimators.
arXiv Detail & Related papers (2021-06-15T06:59:53Z) - Universal Off-Policy Evaluation [64.02853483874334]
We take the first steps towards a universal off-policy estimator (UnO)
We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns.
arXiv Detail & Related papers (2021-04-26T18:54:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.