Related papers: Off-Policy Evaluation of Ranking Policies via Embedding-Space User Behavior Modeling

Off-Policy Evaluation of Ranking Policies via Embedding-Space User Behavior Modeling

URL: http://arxiv.org/abs/2506.00446v1
Date: Sat, 31 May 2025 07:58:53 GMT
Title: Off-Policy Evaluation of Ranking Policies via Embedding-Space User Behavior Modeling
Authors: Tatsuki Takahashi, Chihiro Maru, Hiroko Shoji,
Abstract summary: Off-policy evaluation in ranking settings with large ranking action spaces is essential for assessing new recommender policies.<n>We introduce two new assumptions: no direct effect on rankings and user behavior model on ranking embedding spaces.<n>We then propose the generalized marginalized inverse propensity score estimator with statistically desirable properties.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Off-policy evaluation (OPE) in ranking settings with large ranking action spaces, which stems from an increase in both the number of unique actions and length of the ranking, is essential for assessing new recommender policies using only logged bandit data from previous versions. To address the high variance issues associated with existing estimators, we introduce two new assumptions: no direct effect on rankings and user behavior model on ranking embedding spaces. We then propose the generalized marginalized inverse propensity score (GMIPS) estimator with statistically desirable properties compared to existing ones. Finally, we demonstrate that the GMIPS achieves the lowest MSE. Notably, among GMIPS variants, the marginalized reward interaction IPS (MRIPS) incorporates a doubly marginalized importance weight based on a cascade behavior assumption on ranking embeddings. MRIPS effectively balances the trade-off between bias and variance, even as the ranking action spaces increase and the above assumptions may not hold, as evidenced by our experiments.

Related papers

In-context Ranking Preference Optimization [48.36442791241395]
We propose an In-context Ranking Preference Optimization (IRPO) framework to optimize large language models (LLMs) based on ranking lists constructed during inference.<n>We show IRPO outperforms standard DPO approaches in ranking performance, highlighting its effectiveness in aligning LLMs with direct in-context ranking preferences.
arXiv Detail & Related papers (2025-04-21T23:06:12Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction [22.215852332444907]
We study the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. The typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces. We develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space.
arXiv Detail & Related papers (2024-02-03T14:38:09Z)
Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies. Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z)
Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems. The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy. We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z)
Off-Policy Evaluation of Ranking Policies under Diverse User Behavior [25.226825574282937]
Inverse Propensity Scoring (IPS) becomes extremely inaccurate in the ranking setup due to its high variance under large action spaces. This work explores a far more general formulation where user behavior is diverse and can vary depending on the user context. We show that the resulting estimator, which we call Adaptive IPS (AIPS), can be unbiased under any complex user behavior.
arXiv Detail & Related papers (2023-06-26T22:31:15Z)
Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling [30.835774920236872]
We study off-policy evaluation of contextual bandit policies for large discrete action spaces. We propose a new estimator, called OffCEM, that is based on the conjunct effect model (CEM), a novel decomposition of the causal effect into a cluster effect and a residual effect. Experiments demonstrate that OffCEM provides substantial improvements in OPE especially in the presence of many actions.
arXiv Detail & Related papers (2023-05-14T04:16:40Z)
Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z)
Off-policy evaluation for learning-to-rank via interpolating the item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production. We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings. In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z)
Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model [11.101369123145588]
Off-policy evaluation for ranking policies enables performance estimation of new ranking policies using only logged data. Previous studies introduce some assumptions on user behavior to make the item space tractable. We propose the Cascade Doubly Robust estimator, which assumes that a user interacts with items sequentially from the top position in a ranking.
arXiv Detail & Related papers (2022-02-03T12:42:33Z)
Control Variates for Slate Off-Policy Evaluation [112.35528337130118]
We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions. We obtain new estimators with risk improvement guarantees over both the PI and self-normalized PI estimators.
arXiv Detail & Related papers (2021-06-15T06:59:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.