Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval
- URL: http://arxiv.org/abs/2505.16967v1
- Date: Thu, 22 May 2025 17:47:57 GMT
- Title: Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval
- Authors: Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin,
- Abstract summary: We prune 8 out of 15 datasets from the BGE collection and increase nDCG@10 on BEIR by 1.0 point.<n>We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives.<n>Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR.
- Score: 54.68474647525667
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35$\times$ and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.
Related papers
- Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning [21.70706473875226]
We propose Reinforcement Distillation (REDI), a two-stage framework.<n> Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT)<n>Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective.<n>Our empirical evaluations demonstrate REDI's superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks.
arXiv Detail & Related papers (2025-05-30T17:47:17Z) - Evaluating the Impact of Data Cleaning on the Quality of Generated Pull Request Descriptions [2.2134505920972547]
Pull Requests (PRs) are central to collaborative coding.<n>Many PRs are incomplete, uninformative, or have out-of-context content.<n>This study examines the prevalence of "noisy" PRs and evaluates their impact on description generation models.
arXiv Detail & Related papers (2025-05-02T08:58:42Z) - SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement [100.85923086072204]
ThinkLite-VL improves the average performance of Qwen2.5-VL-7B-Instruct by 7%, using only 11k training samples with no knowledge distillation.<n>Our code, data, and model are available at https://github.com/si0wang/ThinkLite-VL.
arXiv Detail & Related papers (2025-04-10T17:49:05Z) - Segmentation Dataset for Reinforced Concrete Construction [4.32009010195029]
This paper provides a dataset of 14,805 RGB images with segmentation labels for autonomous robotic inspection of reinforced concrete defects.<n>YOLOv8L-seg performs best, achieving a validation mIOU score of up to 0.59.<n>Lack of publicly available data is identified as a significant contributor to false negatives.
arXiv Detail & Related papers (2024-07-12T15:53:15Z) - SAP: Corrective Machine Unlearning with Scaled Activation Projection for Label Noise Robustness [9.080678336379528]
We introduce Scaled Activation Projection (SAP), a novel SVD-based corrective machine unlearning algorithm.<n>SAP mitigates label noise by identifying a small subset of trusted samples using cross-entropy loss.<n>We observe generalization improvements of 2.31% for a Vision Transformer model trained on naturally corrupted Clothing1M.
arXiv Detail & Related papers (2024-03-13T15:32:08Z) - CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? [72.19502317793133]
We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP)
We present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases.
arXiv Detail & Related papers (2024-03-07T14:43:17Z) - gSASRec: Reducing Overconfidence in Sequential Recommendation Trained
with Negative Sampling [67.71952251641545]
We show that models trained with negative sampling tend to overestimate the probabilities of positive interactions.
We propose a novel Generalised Binary Cross-Entropy Loss function (gBCE) and theoretically prove that it can mitigate overconfidence.
We show through detailed experiments on three datasets that gSASRec does not exhibit the overconfidence problem.
arXiv Detail & Related papers (2023-08-14T14:56:40Z) - Rethinking InfoNCE: How Many Negative Samples Do You Need? [54.146208195806636]
We study how many negative samples are optimal for InfoNCE in different scenarios via a semi-quantitative theoretical framework.
We estimate the optimal negative sampling ratio using the $K$ value that maximizes the training effectiveness function.
arXiv Detail & Related papers (2021-05-27T08:38:29Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z) - Reinforced Negative Sampling over Knowledge Graph for Recommendation [106.07209348727564]
We develop a new negative sampling model, Knowledge Graph Policy Network (kgPolicy), which works as a reinforcement learning agent to explore high-quality negatives.
kgPolicy navigates from the target positive interaction, adaptively receives knowledge-aware negative signals, and ultimately yields a potential negative item to train the recommender.
arXiv Detail & Related papers (2020-03-12T12:44:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.