Related papers: Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

URL: http://arxiv.org/abs/2410.18889v2
Date: Fri, 12 Sep 2025 17:18:27 GMT
Title: Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance
Authors: Omer Nahum, Nitay Calderon, Orgad Keller, Idan Szpektor, Roi Reichart,
Abstract summary: Large language models (LLMs) offer new opportunities to enhance the annotation process.<n>We compare expert, crowd-sourced, and LLM-based annotations in terms of the agreement, label quality, and efficiency.<n>Our findings reveal a substantial number of label errors, which, when corrected, a significant upward shift in reported model performance.
Score: 28.524573212179124
License: http://creativecommons.org/licenses/by/4.0/
Abstract: NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in large language models (LLMs) offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. We conduct a case study on four factual consistency datasets from the TRUE benchmark, spanning diverse NLP tasks, and on SummEval, which uses Likert-scale ratings of summary quality across multiple dimensions. We empirically analyze the labeling quality of existing datasets and compare expert, crowd-sourced, and LLM-based annotations in terms of the agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs' so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve performance.

Related papers

Multi-Agent VLMs Guided Self-Training with PNU Loss for Low-Resource Offensive Content Detection [32.68131638705225]
We propose a self-training framework that leverages abundant unlabeled data through collaborative pseudo-labeling.<n>Our method iteratively assigns pseudo-labels to unlabeled instances with the support of Multi-Agent Vision-Language Models.<n>Experiments on benchmark datasets demonstrate that our framework substantially outperforms baselines under limited supervision.
arXiv Detail & Related papers (2025-11-14T08:03:35Z)
Human-Corrected Labels Learning: Enhancing Labels Quality via Human Correction of VLMs Discrepancies [6.58446551781724]
We propose Human-Corrected Labels (HCLs), a novel setting that efficient human correction for VLM-generated noisy labels.<n>HCLs deploys human correction only for instances with VLM discrepancies, achieving both higher-quality annotations and reduced labor costs.<n>Our approach achieves superior classification performance and is robust to label noise, validating the effectiveness of HCL in practical weak supervision scenarios.
arXiv Detail & Related papers (2025-11-12T07:38:19Z)
EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI [36.91800117379075]
EVADE is a framework for generating and validating explanations to detect errors using large language models.<n>HLV arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation.
arXiv Detail & Related papers (2025-11-12T03:49:05Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
Hey, That's My Data! Label-Only Dataset Inference in Large Language Models [63.35066172530291]
CatShift is a label-only dataset-inference framework.<n>It capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data.
arXiv Detail & Related papers (2025-06-06T13:02:59Z)
Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation [35.1208076670736]
We propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty.<n>To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model.
arXiv Detail & Related papers (2025-06-04T11:42:37Z)
LLMs as Data Annotators: How Close Are We to Human Performance [47.61698665650761]
Manual annotation of data is labor-intensive, time-consuming, and costly. In-context learning (ICL) in which some examples related to the task are given in the prompt can lead to inefficiencies and suboptimal model performance. This paper presents experiments comparing several LLMs, considering different embedding models, across various datasets for the Named Entity Recognition (NER) task.
arXiv Detail & Related papers (2025-04-21T11:11:07Z)
Teaching Your Models to Understand Code via Focal Preference Alignment [70.71693365502212]
In existing approaches, a set of n candidate solutions is evaluated based on test case success rates.<n>Because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships.<n>We propose Target-DPO, a new preference alignment framework that mimics human iterative debug to refine Code LLMs.
arXiv Detail & Related papers (2025-03-04T16:56:34Z)
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon [11.753349115726952]
Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that distorts benchmark prompts. By rephrasing inputs while preserving semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns.
arXiv Detail & Related papers (2025-02-11T10:43:36Z)
Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data [54.934578742209716]
In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets. LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student. Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.
arXiv Detail & Related papers (2024-11-12T18:57:59Z)
Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold Labels [75.77877889764073]
Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels. This study explores whether solely utilizing unlabeled data can elicit strong model capabilities. We propose a new paradigm termed zero-to-strong generalization.
arXiv Detail & Related papers (2024-09-19T02:59:44Z)
Self-training Large Language Models through Knowledge Detection [26.831873737733737]
Large language models (LLMs) often necessitate extensive labeled datasets and training compute to achieve impressive performance across downstream tasks. This paper explores a self-training paradigm, where the LLM autonomously curates its own labels and selectively trains on unknown data samples. Empirical evaluations demonstrate significant improvements in reducing hallucination in generation across multiple subjects.
arXiv Detail & Related papers (2024-06-17T07:25:09Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)
AQuA: A Benchmarking Tool for Label Quality Assessment [16.83510474053401]
Recent studies have found datasets widely used to train and evaluate machine learning models to have pervasive labeling errors. We propose a benchmarking environment AQuA to rigorously evaluate methods that enable machine learning in the presence of label noise.
arXiv Detail & Related papers (2023-06-15T19:42:11Z)
CELDA: Leveraging Black-box Language Model as Enhanced Classifier without Labels [14.285609493077965]
Clustering-enhanced Linear Discriminative Analysis, a novel approach that improves the text classification accuracy with a very weak-supervision signal. Our framework draws a precise decision boundary without accessing weights or gradients of the LM model or data labels.
arXiv Detail & Related papers (2023-06-05T08:35:31Z)
Ground Truth Inference for Weakly Supervised Entity Matching [76.6732856489872]
We propose a simple but powerful labeling model for weak supervision tasks. We then tailor the labeling model specifically to the task of entity matching. We show that our labeling model results in a 9% higher F1 score on average than the best existing method.
arXiv Detail & Related papers (2022-11-13T17:57:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.