Related papers: Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

URL: http://arxiv.org/abs/2505.16967v1
Date: Thu, 22 May 2025 17:47:57 GMT
Title: Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval
Authors: Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin,
Abstract summary: We prune 8 out of 15 datasets from the BGE collection and increase nDCG@10 on BEIR by 1.0 point.<n>We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives.<n>Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR.
Score: 54.68474647525667
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35$\times$ and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.

Related papers

Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning [21.70706473875226]
We propose Reinforcement Distillation (REDI), a two-stage framework.<n> Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT)<n>Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective.<n>Our empirical evaluations demonstrate REDI's superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks.
arXiv Detail & Related papers (2025-05-30T17:47:17Z)
Evaluating the Impact of Data Cleaning on the Quality of Generated Pull Request Descriptions [2.2134505920972547]
Pull Requests (PRs) are central to collaborative coding.<n>Many PRs are incomplete, uninformative, or have out-of-context content.<n>This study examines the prevalence of "noisy" PRs and evaluates their impact on description generation models.
arXiv Detail & Related papers (2025-05-02T08:58:42Z)
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement [100.85923086072204]
ThinkLite-VL improves the average performance of Qwen2.5-VL-7B-Instruct by 7%, using only 11k training samples with no knowledge distillation.<n>Our code, data, and model are available at https://github.com/si0wang/ThinkLite-VL.
arXiv Detail & Related papers (2025-04-10T17:49:05Z)
Segmentation Dataset for Reinforced Concrete Construction [4.32009010195029]
This paper provides a dataset of 14,805 RGB images with segmentation labels for autonomous robotic inspection of reinforced concrete defects.<n>YOLOv8L-seg performs best, achieving a validation mIOU score of up to 0.59.<n>Lack of publicly available data is identified as a significant contributor to false negatives.
arXiv Detail & Related papers (2024-07-12T15:53:15Z)
SAP: Corrective Machine Unlearning with Scaled Activation Projection for Label Noise Robustness [9.080678336379528]
We introduce Scaled Activation Projection (SAP), a novel SVD-based corrective machine unlearning algorithm.<n>SAP mitigates label noise by identifying a small subset of trusted samples using cross-entropy loss.<n>We observe generalization improvements of 2.31% for a Vision Transformer model trained on naturally corrupted Clothing1M.
arXiv Detail & Related papers (2024-03-13T15:32:08Z)
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? [72.19502317793133]
We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP) We present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases.
arXiv Detail & Related papers (2024-03-07T14:43:17Z)
gSASRec: Reducing Overconfidence in Sequential Recommendation Trained with Negative Sampling [67.71952251641545]
We show that models trained with negative sampling tend to overestimate the probabilities of positive interactions. We propose a novel Generalised Binary Cross-Entropy Loss function (gBCE) and theoretically prove that it can mitigate overconfidence. We show through detailed experiments on three datasets that gSASRec does not exhibit the overconfidence problem.
arXiv Detail & Related papers (2023-08-14T14:56:40Z)
Rethinking InfoNCE: How Many Negative Samples Do You Need? [54.146208195806636]
We study how many negative samples are optimal for InfoNCE in different scenarios via a semi-quantitative theoretical framework. We estimate the optimal negative sampling ratio using the $K$ value that maximizes the training effectiveness function.
arXiv Detail & Related papers (2021-05-27T08:38:29Z)
TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE) In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement? We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
Reinforced Negative Sampling over Knowledge Graph for Recommendation [106.07209348727564]
We develop a new negative sampling model, Knowledge Graph Policy Network (kgPolicy), which works as a reinforcement learning agent to explore high-quality negatives. kgPolicy navigates from the target positive interaction, adaptively receives knowledge-aware negative signals, and ultimately yields a potential negative item to train the recommender.
arXiv Detail & Related papers (2020-03-12T12:44:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.