Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future
- URL: http://arxiv.org/abs/2206.02280v1
- Date: Sun, 5 Jun 2022 22:31:45 GMT
- Title: Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future
- Authors: Jan-Christoph Klie, Bonnie Webber, Iryna Gurevych
- Abstract summary: We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
- Score: 63.99570204416711
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Annotated data is an essential ingredient in natural language processing for
training and evaluating machine learning models. It is therefore very desirable
for the annotations to be of high quality. Recent work, however, has shown that
several popular datasets contain a surprising amount of annotation errors or
inconsistencies. To alleviate this issue, many methods for annotation error
detection have been devised over the years. While researchers show that their
approaches work well on their newly introduced datasets, they rarely compare
their methods to previous work or on the same datasets. This raises strong
concerns on methods' general performance and makes it difficult to asses their
strengths and weaknesses. We therefore reimplement 18 methods for detecting
potential annotation errors and evaluate them on 9 English datasets for text
classification as well as token and span labeling. In addition, we define a
uniform evaluation setup including a new formalization of the annotation error
detection task, evaluation protocol and general best practices. To facilitate
future research and reproducibility, we release our datasets and
implementations in an easy-to-use and open source software package.
Related papers
- Less for More: Enhancing Preference Learning in Generative Language Models with Automated Self-Curation of Training Corpora [4.008122785948581]
Ambiguity in language presents challenges in developing more enhanced language models.
We introduce a self-curation method that preprocesses annotated datasets by leveraging proxy models trained directly on these datasets.
Our method enhances preference learning by automatically detecting and removing ambiguous annotations within the dataset.
arXiv Detail & Related papers (2024-08-23T02:27:14Z) - GPTs Are Multilingual Annotators for Sequence Generation Tasks [11.59128394819439]
This study proposes an autonomous annotation method by utilizing large language models.
We demonstrate that the proposed method is not just cost-efficient but also applicable for low-resource language annotation.
arXiv Detail & Related papers (2024-02-08T09:44:02Z) - Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks [9.110872603799839]
Supervised classification heavily depends on datasets annotated by humans.
In subjective tasks such as toxicity classification, these annotations often exhibit low agreement among raters.
In this work, we propose textbfAnnotator Awares for Texts (AART) for subjective classification tasks.
arXiv Detail & Related papers (2023-11-16T10:18:32Z) - Deep Active Learning with Noisy Oracle in Object Detection [5.5165579223151795]
We propose a composite active learning framework including a label review module for deep object detection.
We show that utilizing part of the annotation budget to correct the noisy annotations partially in the active dataset leads to early improvements in model performance.
In our experiments we achieve improvements of up to 4.5 mAP points of object detection performance by incorporating label reviews at equal annotation budget.
arXiv Detail & Related papers (2023-09-30T13:28:35Z) - An Embarrassingly Simple Approach to Semi-Supervised Few-Shot Learning [58.59343434538218]
We propose a simple but quite effective approach to predict accurate negative pseudo-labels of unlabeled data from an indirect learning perspective.
Our approach can be implemented in just few lines of code by only using off-the-shelf operations.
arXiv Detail & Related papers (2022-09-28T02:11:34Z) - Assisted Text Annotation Using Active Learning to Achieve High Quality
with Little Effort [9.379650501033465]
We propose a tool that enables researchers to create large, high-quality, annotated datasets with only a few manual annotations.
We combine an active learning (AL) approach with a pre-trained language model to semi-automatically identify annotation categories.
Our preliminary results show that employing AL strongly reduces the number of annotations for correct classification of even complex and subtle frames.
arXiv Detail & Related papers (2021-12-15T13:14:58Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Towards Good Practices for Efficiently Annotating Large-Scale Image
Classification Datasets [90.61266099147053]
We investigate efficient annotation strategies for collecting multi-class classification labels for a large collection of images.
We propose modifications and best practices aimed at minimizing human labeling effort.
Simulated experiments on a 125k image subset of the ImageNet100 show that it can be annotated to 80% top-1 accuracy with 0.35 annotations per image on average.
arXiv Detail & Related papers (2021-04-26T16:29:32Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Active Learning for Coreference Resolution using Discrete Annotation [76.36423696634584]
We improve upon pairwise annotation for active learning in coreference resolution.
We ask annotators to identify mention antecedents if a presented mention pair is deemed not coreferent.
In experiments with existing benchmark coreference datasets, we show that the signal from this additional question leads to significant performance gains per human-annotation hour.
arXiv Detail & Related papers (2020-04-28T17:17:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.