Related papers: Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

URL: http://arxiv.org/abs/2410.18889v1
Date: Thu, 24 Oct 2024 16:27:03 GMT
Title: Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance
Authors: Omer Nahum, Nitay Calderon, Orgad Keller, Idan Szpektor, Roi Reichart,
Abstract summary: Large language models (LLMs) offer new opportunities to enhance the annotation process. We compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance.
Score: 21.926934384262594
License: http://creativecommons.org/licenses/by/4.0/
Abstract: NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in large language models (LLMs) offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. Through a case study of four datasets from the TRUE benchmark, covering different tasks and domains, we empirically analyze the labeling quality of existing datasets, and compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve model performance.

Related papers

LLMs as Data Annotators: How Close Are We to Human Performance [47.61698665650761]
Manual annotation of data is labor-intensive, time-consuming, and costly. In-context learning (ICL) in which some examples related to the task are given in the prompt can lead to inefficiencies and suboptimal model performance. This paper presents experiments comparing several LLMs, considering different embedding models, across various datasets for the Named Entity Recognition (NER) task.
arXiv Detail & Related papers (2025-04-21T11:11:07Z)
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon [11.753349115726952]
Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that distorts benchmark prompts. By rephrasing inputs while preserving semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns.
arXiv Detail & Related papers (2025-02-11T10:43:36Z)
Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data [54.934578742209716]
In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets. LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student. Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.
arXiv Detail & Related papers (2024-11-12T18:57:59Z)
Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold Labels [75.77877889764073]
Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels. This study explores whether solely utilizing unlabeled data can elicit strong model capabilities. We propose a new paradigm termed zero-to-strong generalization.
arXiv Detail & Related papers (2024-09-19T02:59:44Z)
Self-training Large Language Models through Knowledge Detection [26.831873737733737]
Large language models (LLMs) often necessitate extensive labeled datasets and training compute to achieve impressive performance across downstream tasks. This paper explores a self-training paradigm, where the LLM autonomously curates its own labels and selectively trains on unknown data samples. Empirical evaluations demonstrate significant improvements in reducing hallucination in generation across multiple subjects.
arXiv Detail & Related papers (2024-06-17T07:25:09Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)
AQuA: A Benchmarking Tool for Label Quality Assessment [16.83510474053401]
Recent studies have found datasets widely used to train and evaluate machine learning models to have pervasive labeling errors. We propose a benchmarking environment AQuA to rigorously evaluate methods that enable machine learning in the presence of label noise.
arXiv Detail & Related papers (2023-06-15T19:42:11Z)
CELDA: Leveraging Black-box Language Model as Enhanced Classifier without Labels [14.285609493077965]
Clustering-enhanced Linear Discriminative Analysis, a novel approach that improves the text classification accuracy with a very weak-supervision signal. Our framework draws a precise decision boundary without accessing weights or gradients of the LM model or data labels.
arXiv Detail & Related papers (2023-06-05T08:35:31Z)
Ground Truth Inference for Weakly Supervised Entity Matching [76.6732856489872]
We propose a simple but powerful labeling model for weak supervision tasks. We then tailor the labeling model specifically to the task of entity matching. We show that our labeling model results in a 9% higher F1 score on average than the best existing method.
arXiv Detail & Related papers (2022-11-13T17:57:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.