Debias Can be Unreliable: Mitigating Bias Issue in Evaluating Debiasing Recommendation
- URL: http://arxiv.org/abs/2409.04810v1
- Date: Sat, 07 Sep 2024 12:42:58 GMT
- Title: Debias Can be Unreliable: Mitigating Bias Issue in Evaluating Debiasing Recommendation
- Authors: Chengbing Wang, Wentao Shi, Jizhi Zhang, Wenjie Wang, Hang Pan, Fuli Feng,
- Abstract summary: Traditional evaluation scheme is not suitable for randomly-exposed datasets.
We propose the Unbiased Recall Evaluation scheme, which adjusts the utilization of randomly-exposed datasets to unbiasedly estimate the true Recall performance.
- Score: 34.19561411584444
- License:
- Abstract: Recent work has improved recommendation models remarkably by equipping them with debiasing methods. Due to the unavailability of fully-exposed datasets, most existing approaches resort to randomly-exposed datasets as a proxy for evaluating debiased models, employing traditional evaluation scheme to represent the recommendation performance. However, in this study, we reveal that traditional evaluation scheme is not suitable for randomly-exposed datasets, leading to inconsistency between the Recall performance obtained using randomly-exposed datasets and that obtained using fully-exposed datasets. Such inconsistency indicates the potential unreliability of experiment conclusions on previous debiasing techniques and calls for unbiased Recall evaluation using randomly-exposed datasets. To bridge the gap, we propose the Unbiased Recall Evaluation (URE) scheme, which adjusts the utilization of randomly-exposed datasets to unbiasedly estimate the true Recall performance on fully-exposed datasets. We provide theoretical evidence to demonstrate the rationality of URE and perform extensive experiments on real-world datasets to validate its soundness.
Related papers
- OffsetBias: Leveraging Debiased Data for Tuning Evaluators [1.5790747258969664]
We qualitatively identify six types of biases inherent in various judge models.
Fine-tuning on our dataset significantly enhances the robustness of judge models against biases.
arXiv Detail & Related papers (2024-07-09T05:16:22Z) - Debiased Recommendation with Noisy Feedback [41.38490962524047]
We study intersectional threats to the unbiased learning of the prediction model from data MNAR and OME in the collected data.
First, we design OME-EIB, OME-IPS, and OME-DR estimators, which largely extend the existing estimators to combat OME in real-world recommendation scenarios.
arXiv Detail & Related papers (2024-06-24T23:42:18Z) - Towards Real World Debiasing: A Fine-grained Analysis On Spurious Correlation [17.080528126651977]
We revisit biased distributions in existing benchmarks and real-world datasets, and propose a fine-grained framework for analyzing dataset bias.
Results show that existing methods are incapable of handling real-world biases.
We propose a simple yet effective approach that can be easily applied to existing debias methods, named Debias in Destruction (DiD)
arXiv Detail & Related papers (2024-05-24T06:06:41Z) - Balancing Unobserved Confounding with a Few Unbiased Ratings in Debiased
Recommendations [4.960902915238239]
We propose a theoretically guaranteed model-agnostic balancing approach that can be applied to any existing debiasing method.
The proposed approach makes full use of unbiased data by alternatively correcting model parameters learned with biased data, and adaptively learning balance coefficients of biased samples for further debiasing.
arXiv Detail & Related papers (2023-04-17T08:56:55Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Rethinking Missing Data: Aleatoric Uncertainty-Aware Recommendation [59.500347564280204]
We propose a new Aleatoric Uncertainty-aware Recommendation (AUR) framework.
AUR consists of a new uncertainty estimator along with a normal recommender model.
As the chance of mislabeling reflects the potential of a pair, AUR makes recommendations according to the uncertainty.
arXiv Detail & Related papers (2022-09-22T04:32:51Z) - AutoDebias: Learning to Debias for Recommendation [43.84313723394282]
We propose textitAotoDebias that leverages another (small) set of uniform data to optimize the debiasing parameters.
We derive the generalization bound for AutoDebias and prove its ability to acquire the appropriate debiasing strategy.
arXiv Detail & Related papers (2021-05-10T08:03:48Z) - Reliable Evaluations for Natural Language Inference based on a Unified
Cross-dataset Benchmark [54.782397511033345]
Crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts.
We present a new unified cross-datasets benchmark with 14 NLI datasets and re-evaluate 9 widely-used neural network-based NLI models.
Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.
arXiv Detail & Related papers (2020-10-15T11:50:12Z) - Towards Debiasing NLU Models from Unknown Biases [70.31427277842239]
NLU models often exploit biases to achieve high dataset-specific performance without properly learning the intended task.
We present a self-debiasing framework that prevents models from mainly utilizing biases without knowing them in advance.
arXiv Detail & Related papers (2020-09-25T15:49:39Z) - Identifying Statistical Bias in Dataset Replication [102.92137353938388]
We study a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy.
After correcting for the identified statistical bias, only an estimated $3.6% pm 1.5%$ of the original $11.7% pm 1.0%$ accuracy drop remains unaccounted for.
arXiv Detail & Related papers (2020-05-19T17:48:32Z) - Adversarial Filters of Dataset Biases [96.090959788952]
Large neural models have demonstrated human-level performance on language and vision benchmarks.
Their performance degrades considerably on adversarial or out-of-distribution samples.
We propose AFLite, which adversarially filters such dataset biases.
arXiv Detail & Related papers (2020-02-10T21:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.