Counterfactual Risk Minimization with IPS-Weighted BPR and Self-Normalized Evaluation in Recommender Systems
- URL: http://arxiv.org/abs/2509.00333v1
- Date: Sat, 30 Aug 2025 03:14:56 GMT
- Title: Counterfactual Risk Minimization with IPS-Weighted BPR and Self-Normalized Evaluation in Recommender Systems
- Authors: Rahul Raja, Arpita Vats,
- Abstract summary: inverse propensity scoring (IPS) corrects this bias, but it often suffers from high variance and instability.<n>We present a simple and effective pipeline that integrates IPS-weighted training with an IPS-weighted Bayesian Personalized Ranking objective.<n> Experiments on synthetic and MovieLens 100K data show that our approach generalizes better under unbiased exposure.
- Score: 3.5507492850515323
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning and evaluating recommender systems from logged implicit feedback is challenging due to exposure bias. While inverse propensity scoring (IPS) corrects this bias, it often suffers from high variance and instability. In this paper, we present a simple and effective pipeline that integrates IPS-weighted training with an IPS-weighted Bayesian Personalized Ranking (BPR) objective augmented by a Propensity Regularizer (PR). We compare Direct Method (DM), IPS, and Self-Normalized IPS (SNIPS) for offline policy evaluation, and demonstrate how IPS-weighted training improves model robustness under biased exposure. The proposed PR further mitigates variance amplification from extreme propensity weights, leading to more stable estimates. Experiments on synthetic and MovieLens 100K data show that our approach generalizes better under unbiased exposure while reducing evaluation variance compared to naive and standard IPS methods, offering practical guidance for counterfactual learning and evaluation in real-world recommendation settings.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation [8.907440501295346]
We show that SNIPS is equivalent to using a specific -- but generally sub-optimal -- additive baseline.<n>Our results justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.
arXiv Detail & Related papers (2026-02-16T16:49:23Z) - A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z) - BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z) - Practical Improvements of A/B Testing with Off-Policy Estimation [51.25970890274447]
We introduce a family of unbiased off-policy estimators that achieves lower variance than the standard approach.<n>Our theoretical analysis and experimental results validate the effectiveness and practicality of the proposed method.
arXiv Detail & Related papers (2025-06-12T13:11:01Z) - Off-Policy Evaluation of Ranking Policies via Embedding-Space User Behavior Modeling [0.0]
Off-policy evaluation in ranking settings with large ranking action spaces is essential for assessing new recommender policies.<n>We introduce two new assumptions: no direct effect on rankings and user behavior model on ranking embedding spaces.<n>We then propose the generalized marginalized inverse propensity score estimator with statistically desirable properties.
arXiv Detail & Related papers (2025-05-31T07:58:53Z) - Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts [17.243429150450886]
We propose $textbfMulti-Preference Optimization (MPO) to optimize over entire sets of responses.<n>MPO employs deviation-based weighting, which emphasizes outlier responses that deviate most from the mean reward.<n>We theoretically prove that MPO reduces alignment bias at a rate of $mathcalOleft(frac1sqrtnright)$ with respect to the number of responses per query.
arXiv Detail & Related papers (2024-12-05T21:50:22Z) - Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Off-Policy Evaluation of Ranking Policies under Diverse User Behavior [25.226825574282937]
Inverse Propensity Scoring (IPS) becomes extremely inaccurate in the ranking setup due to its high variance under large action spaces.
This work explores a far more general formulation where user behavior is diverse and can vary depending on the user context.
We show that the resulting estimator, which we call Adaptive IPS (AIPS), can be unbiased under any complex user behavior.
arXiv Detail & Related papers (2023-06-26T22:31:15Z) - Safe Deployment for Counterfactual Learning to Rank with Exposure-Based
Risk Minimization [63.93275508300137]
We introduce a novel risk-aware Counterfactual Learning To Rank method with theoretical guarantees for safe deployment.
Our experimental results demonstrate the efficacy of our proposed method, which is effective at avoiding initial periods of bad performance when little data is available.
arXiv Detail & Related papers (2023-04-26T15:54:23Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - On the Reuse Bias in Off-Policy Reinforcement Learning [28.29153543457396]
Reuse Bias is the bias in off-policy evaluation caused by the reuse of the replay buffer for evaluation and optimization.
We show that the off-policy evaluation and optimization of the current policy with the data from the replay buffer result in an overestimation of the objective.
We present a novel Bias-Regularized Importance Sampling (BIRIS) framework along with practical algorithms, which can alleviate the negative impact of the Reuse Bias.
arXiv Detail & Related papers (2022-09-15T06:20:36Z) - Learning to Estimate Without Bias [57.82628598276623]
Gauss theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models.
In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints.
A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance.
arXiv Detail & Related papers (2021-10-24T10:23:51Z) - Understanding the Effects of Adversarial Personalized Ranking
Optimization Method on Recommendation Quality [6.197934754799158]
We model the learning characteristics of the Bayesian Personalized Ranking (BPR) and APR optimization frameworks.
We show that APR amplifies the popularity bias more than BPR due to an unbalanced number of received positive updates from short-head items.
arXiv Detail & Related papers (2021-07-29T10:22:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.