On the Reliability of Sampling Strategies in Offline Recommender Evaluation
- URL: http://arxiv.org/abs/2508.05398v1
- Date: Thu, 07 Aug 2025 13:50:05 GMT
- Title: On the Reliability of Sampling Strategies in Offline Recommender Evaluation
- Authors: Bruno L. Pereira, Alan Said, Rodrygo L. T. Santos,
- Abstract summary: offline evaluation plays a central role in benchmarking recommender systems when online testing is impractical or risky.<n>It is susceptible to two key sources of bias: exposure bias, where users only interact with items they are shown, and sampling bias, introduced when evaluation is performed on a subset of logged items rather than the full catalog.
- Score: 3.4956406636452626
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Offline evaluation plays a central role in benchmarking recommender systems when online testing is impractical or risky. However, it is susceptible to two key sources of bias: exposure bias, where users only interact with items they are shown, and sampling bias, introduced when evaluation is performed on a subset of logged items rather than the full catalog. While prior work has proposed methods to mitigate sampling bias, these are typically assessed on fixed logged datasets rather than for their ability to support reliable model comparisons under varying exposure conditions or relative to true user preferences. In this paper, we investigate how different combinations of logging and sampling choices affect the reliability of offline evaluation. Using a fully observed dataset as ground truth, we systematically simulate diverse exposure biases and assess the reliability of common sampling strategies along four dimensions: sampling resolution (recommender model separability), fidelity (agreement with full evaluation), robustness (stability under exposure bias), and predictive power (alignment with ground truth). Our findings highlight when and how sampling distorts evaluation outcomes and offer practical guidance for selecting strategies that yield faithful and robust offline comparisons.
Related papers
- Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol can significantly affect evaluation reliability and induce systematic biases.<n>In particular, we show that pairwise evaluation protocols are more vulnerable to distracted evaluation.
arXiv Detail & Related papers (2025-04-20T19:05:59Z) - Towards Robust Offline Evaluation: A Causal and Information Theoretic Framework for Debiasing Ranking Systems [6.540293515339111]
offline evaluation of retrieval-ranking systems is crucial for developing high-performing models.<n>We propose a novel framework for robust offline evaluation of retrieval-ranking systems.<n>Our contributions include (1) a causal formulation for addressing offline evaluation biases, (2) a system-agnostic debiasing framework, and (3) empirical validation of its effectiveness.
arXiv Detail & Related papers (2025-04-04T23:52:57Z) - Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark [53.876493664396506]
Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions.<n>This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context.<n>We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement.<n>To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques.
arXiv Detail & Related papers (2025-01-02T17:01:06Z) - Debias Can be Unreliable: Mitigating Bias Issue in Evaluating Debiasing Recommendation [34.19561411584444]
Traditional evaluation scheme is not suitable for randomly-exposed datasets.<n>We propose the Unbiased Recall Evaluation scheme, which adjusts the utilization of randomly-exposed datasets to unbiasedly estimate the true Recall performance.
arXiv Detail & Related papers (2024-09-07T12:42:58Z) - Balancing Unobserved Confounding with a Few Unbiased Ratings in Debiased
Recommendations [4.960902915238239]
We propose a theoretically guaranteed model-agnostic balancing approach that can be applied to any existing debiasing method.
The proposed approach makes full use of unbiased data by alternatively correcting model parameters learned with biased data, and adaptively learning balance coefficients of biased samples for further debiasing.
arXiv Detail & Related papers (2023-04-17T08:56:55Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy.
We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples.
Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - Holistic Approach to Measure Sample-level Adversarial Vulnerability and
its Utility in Building Trustworthy Systems [17.707594255626216]
Adversarial attack perturbs an image with an imperceptible noise, leading to incorrect model prediction.
We propose a holistic approach for quantifying adversarial vulnerability of a sample by combining different perspectives.
We demonstrate that by reliably estimating adversarial vulnerability at the sample level, it is possible to develop a trustworthy system.
arXiv Detail & Related papers (2022-05-05T12:36:17Z) - On robust risk-based active-learning algorithms for enhanced decision
support [0.0]
Classification models are a fundamental component of physical-asset management technologies such as structural health monitoring (SHM) systems and digital twins.
The paper proposes two novel approaches to counteract the effects of sampling bias: textitsemi-supervised learning, and textitdiscriminative classification models.
arXiv Detail & Related papers (2022-01-07T17:25:41Z) - On conditional versus marginal bias in multi-armed bandits [105.07190334523304]
The bias of the sample means of the arms in multi-armed bandits is an important issue in adaptive data analysis.
We characterize the sign of the conditional bias of monotone functions of the rewards, including the sample mean.
Our results hold for arbitrary conditioning events and leverage natural monotonicity properties of the data collection policy.
arXiv Detail & Related papers (2020-02-19T20:16:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.