Towards Robust Offline Evaluation: A Causal and Information Theoretic Framework for Debiasing Ranking Systems
- URL: http://arxiv.org/abs/2504.03997v1
- Date: Fri, 04 Apr 2025 23:52:57 GMT
- Title: Towards Robust Offline Evaluation: A Causal and Information Theoretic Framework for Debiasing Ranking Systems
- Authors: Seyedeh Baharan Khatami, Sayan Chakraborty, Ruomeng Xu, Babak Salimi,
- Abstract summary: offline evaluation of retrieval-ranking systems is crucial for developing high-performing models.<n>We propose a novel framework for robust offline evaluation of retrieval-ranking systems.<n>Our contributions include (1) a causal formulation for addressing offline evaluation biases, (2) a system-agnostic debiasing framework, and (3) empirical validation of its effectiveness.
- Score: 6.540293515339111
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating retrieval-ranking systems is crucial for developing high-performing models. While online A/B testing is the gold standard, its high cost and risks to user experience require effective offline methods. However, relying on historical interaction data introduces biases-such as selection, exposure, conformity, and position biases-that distort evaluation metrics, driven by the Missing-Not-At-Random (MNAR) nature of user interactions and favoring popular or frequently exposed items over true user preferences. We propose a novel framework for robust offline evaluation of retrieval-ranking systems, transforming MNAR data into Missing-At-Random (MAR) through reweighting combined with black-box optimization, guided by neural estimation of information-theoretic metrics. Our contributions include (1) a causal formulation for addressing offline evaluation biases, (2) a system-agnostic debiasing framework, and (3) empirical validation of its effectiveness. This framework enables more accurate, fair, and generalizable evaluations, enhancing model assessment before deployment.
Related papers
- Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark [53.876493664396506]
Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions.<n>This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context.<n>We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement.<n>To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques.
arXiv Detail & Related papers (2025-01-02T17:01:06Z) - Pessimistic Evaluation [58.736490198613154]
We argue that evaluating information access systems assumes utilitarian values not aligned with traditions of information access based on equal access.
We advocate for pessimistic evaluation of information access systems focusing on worst case utility.
arXiv Detail & Related papers (2024-10-17T15:40:09Z) - Revisiting Reciprocal Recommender Systems: Metrics, Formulation, and Method [60.364834418531366]
We propose five new evaluation metrics that comprehensively and accurately assess the performance of RRS.
We formulate the RRS from a causal perspective, formulating recommendations as bilateral interventions.
We introduce a reranking strategy to maximize matching outcomes, as measured by the proposed metrics.
arXiv Detail & Related papers (2024-08-19T07:21:02Z) - Debiased Recommendation with Noisy Feedback [41.38490962524047]
We study intersectional threats to the unbiased learning of the prediction model from data MNAR and OME in the collected data.
First, we design OME-EIB, OME-IPS, and OME-DR estimators, which largely extend the existing estimators to combat OME in real-world recommendation scenarios.
arXiv Detail & Related papers (2024-06-24T23:42:18Z) - Unbiased Learning to Rank with Biased Continuous Feedback [5.561943356123711]
Unbiased learning-to-rank(LTR) algorithms are verified to model the relative relevance accurately based on noisy feedback.
To provide personalized high-quality recommendation results, recommender systems need model both categorical and continuous biased feedback.
We introduce the pairwise trust bias to separate the position bias, trust bias, and user relevance explicitly.
Experiment results on public benchmark datasets and internal live traffic of a large-scale recommender system at Tencent News show superior results for continuous labels.
arXiv Detail & Related papers (2023-03-08T02:14:08Z) - Exploring validation metrics for offline model-based optimisation with
diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle.
While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples.
This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z) - Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z) - Unbiased Pairwise Learning to Rank in Recommender Systems [4.058828240864671]
Unbiased learning to rank algorithms are appealing candidates and have already been applied in many applications with single categorical labels.
We propose a novel unbiased LTR algorithm to tackle the challenges, which innovatively models position bias in the pairwise fashion.
Experiment results on public benchmark datasets and internal live traffic show the superior results of the proposed method for both categorical and continuous labels.
arXiv Detail & Related papers (2021-11-25T06:04:59Z) - Correcting the User Feedback-Loop Bias for Recommendation Systems [34.44834423714441]
We propose a systematic and dynamic way to correct user feedback-loop bias in recommendation systems.
Our method includes a deep-learning component to learn each user's dynamic rating history embedding.
We empirically validated the existence of such user feedback-loop bias in real world recommendation systems.
arXiv Detail & Related papers (2021-09-13T15:02:55Z) - Modeling Online Behavior in Recommender Systems: The Importance of
Temporal Context [30.894950420437926]
We show how omitting temporal context when evaluating recommender system performance leads to false confidence.
We propose a training procedure to further embed the temporal context in existing models.
Results show that including our temporal objective can improve recall@20 by up to 20%.
arXiv Detail & Related papers (2020-09-19T19:36:43Z) - Providing reliability in Recommender Systems through Bernoulli Matrix
Factorization [63.732639864601914]
This paper proposes Bernoulli Matrix Factorization (BeMF) to provide both prediction values and reliability values.
BeMF acts on model-based collaborative filtering rather than on memory-based filtering.
The more reliable a prediction is, the less liable it is to be wrong.
arXiv Detail & Related papers (2020-06-05T14:24:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.