Related papers: Enhancing LLM-Based Data Annotation with Error Decomposition

Enhancing LLM-Based Data Annotation with Error Decomposition

URL: http://arxiv.org/abs/2601.11920v1
Date: Sat, 17 Jan 2026 05:43:17 GMT
Title: Enhancing LLM-Based Data Annotation with Error Decomposition
Authors: Zhen Xu, Vedant Khatri, Yijun Dai, Xiner Liu, Siyan Li, Xuanming Zhang, Renzhe Yu,
Abstract summary: Large language models offer a scalable alternative to human coding for data annotation tasks.<n>Their performance on subjective annotation tasks is less consistent and more prone to errors.<n>We propose a diagnostic evaluation paradigm that incorporates a human-in-the-loop step to separate task-inherent ambiguity from model-driven inaccuracies.
Score: 6.6544828402388445
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models offer a scalable alternative to human coding for data annotation tasks, enabling the scale-up of research across data-intensive domains. While LLMs are already achieving near-human accuracy on objective annotation tasks, their performance on subjective annotation tasks, such as those involving psychological constructs, is less consistent and more prone to errors. Standard evaluation practices typically collapse all annotation errors into a single alignment metric, but this simplified approach may obscure different kinds of errors that affect final analytical conclusions in different ways. Here, we propose a diagnostic evaluation paradigm that incorporates a human-in-the-loop step to separate task-inherent ambiguity from model-driven inaccuracies and assess annotation quality in terms of their potential downstream impacts. We refine this paradigm on ordinal annotation tasks, which are common in subjective annotation. The refined paradigm includes: (1) a diagnostic taxonomy that categorizes LLM annotation errors along two dimensions: source (model-specific vs. task-inherent) and type (boundary ambiguity vs. conceptual misidentification); (2) a lightweight human annotation test to estimate task-inherent ambiguity from LLM annotations; and (3) a computational method to decompose observed LLM annotation errors following our taxonomy. We validate this paradigm on four educational annotation tasks, demonstrating both its conceptual validity and practical utility. Theoretically, our work provides empirical evidence for why excessively high alignment is unrealistic in specific annotation tasks and why single alignment metrics inadequately reflect the quality of LLM annotations. In practice, our paradigm can be a low-cost diagnostic tool that assesses the suitability of a given task for LLM annotation and provides actionable insights for further technical optimization.

Related papers

Are Multimodal Large Language Models Good Annotators for Image Tagging? [62.01475514488922]
This paper aims to analyze the gap between MLLM-generated and human annotations.<n>We propose TagLLM, a novel framework for image tagging, which aims to narrow the gap between MLLM-generated and human annotations.
arXiv Detail & Related papers (2026-02-24T14:53:16Z)
Fine-Grained Detection of Context-Grounded Hallucinations Using LLMs [16.173245551933178]
Context-grounded hallucinations are cases where model outputs contain information not verifiable against the source text.<n>We study the applicability of LLMs for localizing such hallucinations, as a more practical alternative to existing complex evaluation pipelines.
arXiv Detail & Related papers (2025-09-26T17:03:24Z)
Can Reasoning Help Large Language Models Capture Human Annotator Disagreement? [84.32752330104775]
Variation in human annotation (i.e., disagreements) is common in NLP.<n>We evaluate the influence of different reasoning settings on Large Language Model disagreement modeling.<n>Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling.
arXiv Detail & Related papers (2025-06-24T09:49:26Z)
Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation [35.1208076670736]
We propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty.<n>To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model.
arXiv Detail & Related papers (2025-06-04T11:42:37Z)
Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation [96.18720164390699]
This paper explores the use of large language models (LLMs) for annotating document utility in training retrieval and retrieval-augmented generation (RAG) systems.<n>Our results show that LLM-generated annotations enhance out-of-domain retrieval performance and improve RAG outcomes compared to models trained solely on human annotations or downstream QA metrics.
arXiv Detail & Related papers (2025-04-07T16:05:52Z)
Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [79.40678802098026]
Math Word Problems serve as a crucial benchmark for evaluating Large Language Models' reasoning abilities.<n>Current error classification methods rely on static and predefined categories.<n>We propose Error-Aware Prompting (EAP) that incorporates common error patterns as explicit guidance.
arXiv Detail & Related papers (2025-01-26T16:17:57Z)
To Err Is Human; To Annotate, SILICON? Reducing Measurement Error in LLM Annotation [11.470318058523466]
Large Language Models (LLMs) promise a cost-effective scalable alternative to human annotation.<n>We develop the SILICON methodology to systematically reduce measurement error from LLM annotation.<n>Our evidence indicates that reducing each error source is necessary, and that SILICON supports rigorous annotation in management research.
arXiv Detail & Related papers (2024-12-19T02:21:41Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Understanding and Mitigating Classification Errors Through Interpretable Token Patterns [58.91023283103762]
Characterizing errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors. We propose to discover those patterns of tokens that distinguish correct and erroneous predictions. We show that our method, Premise, performs well in practice.
arXiv Detail & Related papers (2023-11-18T00:24:26Z)
CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale. Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Discover, Explanation, Improvement: An Automatic Slice Detection Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints. This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks. Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
Revisiting Unsupervised Meta-Learning: Amplifying or Compensating for the Characteristics of Few-Shot Tasks [30.893785366366078]
We develop a practical approach towards few-shot image classification, where a visual recognition system is constructed with limited data. We find that the base class set labels are not necessary, and discriminative embeddings could be meta-learned in an unsupervised manner. Experiments on few-shot learning benchmarks verify our approaches outperform previous methods by a 4-10% performance gap.
arXiv Detail & Related papers (2020-11-30T10:08:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.