ActiveAED: A Human in the Loop Improves Annotation Error Detection
- URL: http://arxiv.org/abs/2305.20045v1
- Date: Wed, 31 May 2023 17:18:47 GMT
- Title: ActiveAED: A Human in the Loop Improves Annotation Error Detection
- Authors: Leon Weber and Barbara Plank
- Abstract summary: Even widely-used benchmark datasets contain substantial number of erroneous annotations.
We propose ActiveAED, an AED method that can detect errors more accurately by repeatedly querying a human for error corrections in its prediction loop.
We evaluate ActiveAED on eight datasets spanning five different tasks and find that it leads to improvements over the state of the art on seven of them, with gains of up to six percentage points in average precision.
- Score: 22.61786427296688
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Manually annotated datasets are crucial for training and evaluating Natural
Language Processing models. However, recent work has discovered that even
widely-used benchmark datasets contain a substantial number of erroneous
annotations. This problem has been addressed with Annotation Error Detection
(AED) models, which can flag such errors for human re-annotation. However, even
though many of these AED methods assume a final curation step in which a human
annotator decides whether the annotation is erroneous, they have been developed
as static models without any human-in-the-loop component. In this work, we
propose ActiveAED, an AED method that can detect errors more accurately by
repeatedly querying a human for error corrections in its prediction loop. We
evaluate ActiveAED on eight datasets spanning five different tasks and find
that it leads to improvements over the state of the art on seven of them, with
gains of up to six percentage points in average precision.
Related papers
- Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models [57.758735361535486]
TGEA is an error-annotated dataset for text generation from pretrained language models (PLMs)
We create an error taxonomy to cover 24 types of errors occurring in PLM-generated sentences.
This is the first dataset with comprehensive annotations for PLM-generated texts.
arXiv Detail & Related papers (2025-03-06T09:14:02Z) - Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)
RISE injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation.
Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
arXiv Detail & Related papers (2024-10-09T07:43:38Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.
XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.
Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - Donkii: Can Annotation Error Detection Methods Find Errors in
Instruction-Tuning Datasets? [29.072740239139087]
We present a first and novel benchmark for Error Detection (AED) on instruction tuning data: DONKII.
We find that all three datasets contain clear errors, which sometimes propagate directly into instruction-tuned LLMs.
Our results show that the choice of the right AED method and model size is indeed crucial and derive practical recommendations for how to use AED methods to clean instruction-tuning data.
arXiv Detail & Related papers (2023-09-04T15:34:02Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - Contrastive Error Attribution for Finetuned Language Models [35.80256755393739]
noisy and misannotated data is a core cause of hallucinations and unfaithful outputs in Natural Language Generation (NLG) tasks.
We introduce a framework to identify and remove low-quality training instances that lead to undesirable outputs.
We show that existing approaches for error tracing, such as gradient-based influence measures, do not perform reliably for detecting faithfulness errors.
arXiv Detail & Related papers (2022-12-21T02:28:07Z) - Improving Named Entity Recognition in Telephone Conversations via
Effective Active Learning with Human in the Loop [2.1004132913758267]
We present an active learning framework that leverages human in the loop learning to identify data samples from the annotated dataset for re-annotation.
By re-annotating only about 6% training instances out of the whole dataset, the F1 score for a certain entity type can be significantly improved by about 25%.
arXiv Detail & Related papers (2022-11-02T17:55:04Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z) - Effect of Annotation Errors on Drone Detection with YOLOv3 [14.519138724931446]
In this work, different types of annotation errors for object detection problem are simulated and the performance of a popular state-of-the-art object detector, YOLOv3, is examined.
Some inevitable annotation errors in CVPR-2020 Anti-UAV Challenge dataset is also examined in this manner.
arXiv Detail & Related papers (2020-04-02T15:06:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.