Related papers: Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation

Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation

URL: http://arxiv.org/abs/2602.14914v1
Date: Mon, 16 Feb 2026 16:49:23 GMT
Title: Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation
Authors: Olivier Jeunen, Shashank Gupta,
Abstract summary: We show that SNIPS is equivalent to using a specific -- but generally sub-optimal -- additive baseline.<n>Our results justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.
Score: 8.907440501295346
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Off-policy evaluation (OPE) is essential for assessing ranking and recommendation systems without costly online interventions. Self-Normalised Inverse Propensity Scoring (SNIPS) is a standard tool for variance reduction in OPE, leveraging a multiplicative control variate. Recent advances in off-policy learning suggest that additive control variates (baseline corrections) may offer superior performance, yet theoretical guarantees for evaluation are lacking. This paper provides a definitive answer: we prove that $β^\star$-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific -- but generally sub-optimal -- additive baseline. Our results theoretically justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.

Related papers

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z)
Mitigating Mismatch within Reference-based Preference Optimization [55.07698254211876]
Direct Preference Optimization (DPO) has become the de facto standard for offline preference alignment of large language models.<n>DPO weighs each update relative to a reference, which stabilizes the training by regularizing the updates within a trusted region.<n>This reliance becomes problematic for pessimistic pairs, where the reference model prefers the rejected response.<n>We modify DPO to treat the reference as neutral when it is pessimistic by replacing $_-_mathrmref$ with $_-max0,_mathrmref$.
arXiv Detail & Related papers (2026-02-12T12:55:51Z)
Counterfactual Risk Minimization with IPS-Weighted BPR and Self-Normalized Evaluation in Recommender Systems [3.5507492850515323]
inverse propensity scoring (IPS) corrects this bias, but it often suffers from high variance and instability.<n>We present a simple and effective pipeline that integrates IPS-weighted training with an IPS-weighted Bayesian Personalized Ranking objective.<n> Experiments on synthetic and MovieLens 100K data show that our approach generalizes better under unbiased exposure.
arXiv Detail & Related papers (2025-08-30T03:14:56Z)
Off-Policy Evaluation of Ranking Policies via Embedding-Space User Behavior Modeling [0.0]
Off-policy evaluation in ranking settings with large ranking action spaces is essential for assessing new recommender policies.<n>We introduce two new assumptions: no direct effect on rankings and user behavior model on ranking embedding spaces.<n>We then propose the generalized marginalized inverse propensity score estimator with statistically desirable properties.
arXiv Detail & Related papers (2025-05-31T07:58:53Z)
Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric. We propose a single framework built on their equivalence in learning scenarios. Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction [22.215852332444907]
We study the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. The typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces. We develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space.
arXiv Detail & Related papers (2024-02-03T14:38:09Z)
Off-Policy Evaluation of Ranking Policies under Diverse User Behavior [25.226825574282937]
Inverse Propensity Scoring (IPS) becomes extremely inaccurate in the ranking setup due to its high variance under large action spaces. This work explores a far more general formulation where user behavior is diverse and can vary depending on the user context. We show that the resulting estimator, which we call Adaptive IPS (AIPS), can be unbiased under any complex user behavior.
arXiv Detail & Related papers (2023-06-26T22:31:15Z)
The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy. emphnatural policy gradient (NPG) to converge to a globally optimal. policy at an $O (1/t) rate gradient. We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z)
Control Variates for Slate Off-Policy Evaluation [112.35528337130118]
We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions. We obtain new estimators with risk improvement guarantees over both the PI and self-normalized PI estimators.
arXiv Detail & Related papers (2021-06-15T06:59:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.