Analyzing Dataset Annotation Quality Management in the Wild
- URL: http://arxiv.org/abs/2307.08153v4
- Date: Sat, 9 Mar 2024 14:18:41 GMT
- Title: Analyzing Dataset Annotation Quality Management in the Wild
- Authors: Jan-Christoph Klie, Richard Eckart de Castilho, Iryna Gurevych
- Abstract summary: Even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, biases, or artifacts.
While practices and guidelines regarding dataset creation projects exist, large-scale analysis has yet to be performed on how quality management is conducted.
- Score: 63.07224587146207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data quality is crucial for training accurate, unbiased, and trustworthy
machine learning models as well as for their correct evaluation. Recent works,
however, have shown that even popular datasets used to train and evaluate
state-of-the-art models contain a non-negligible amount of erroneous
annotations, biases, or artifacts. While practices and guidelines regarding
dataset creation projects exist, to our knowledge, large-scale analysis has yet
to be performed on how quality management is conducted when creating natural
language datasets and whether these recommendations are followed. Therefore, we
first survey and summarize recommended quality management practices for dataset
creation as described in the literature and provide suggestions for applying
them. Then, we compile a corpus of 591 scientific publications introducing text
datasets and annotate it for quality-related aspects, such as annotator
management, agreement, adjudication, or data validation. Using these
annotations, we then analyze how quality management is conducted in practice. A
majority of the annotated publications apply good or excellent quality
management. However, we deem the effort of 30\% of the works as only subpar.
Our analysis also shows common errors, especially when using inter-annotator
agreement and computing annotation error rates.
Related papers
- QuRating: Selecting High-Quality Data for Training Language Models [64.83332850645074]
We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality.
In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value.
We train a Qur model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria.
arXiv Detail & Related papers (2024-02-15T06:36:07Z) - One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets.
We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z) - A Novel Metric for Measuring Data Quality in Classification Applications
(extended version) [0.0]
We introduce and explain a novel metric to measure data quality.
This metric is based on the correlated evolution between the classification performance and the deterioration of data.
We provide an interpretation of each criterion and examples of assessment levels.
arXiv Detail & Related papers (2023-12-13T11:20:09Z) - Collect, Measure, Repeat: Reliability Factors for Responsible AI Data
Collection [8.12993269922936]
We argue that data collection for AI should be performed in a responsible manner.
We propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics.
arXiv Detail & Related papers (2023-08-22T18:01:27Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Critical analysis on the reproducibility of visual quality assessment
using deep features [6.746400031322727]
Data used to train supervised machine learning models are commonly split into independent training, validation, and test sets.
This paper illustrates that complex data leakage cases have occurred in the no-reference image and video quality assessment literature.
arXiv Detail & Related papers (2020-09-10T09:51:18Z) - Summary-Source Proposition-level Alignment: Task, Datasets and
Supervised Baseline [94.0601799665342]
Aligning sentences in a reference summary with their counterparts in source documents was shown as a useful auxiliary summarization task.
We propose establishing summary-source alignment as an explicit task, while introducing two major novelties.
We create a novel training dataset for proposition-level alignment, derived automatically from available summarization evaluation data.
We present a supervised proposition alignment baseline model, showing improved alignment-quality over the unsupervised approach.
arXiv Detail & Related papers (2020-09-01T17:27:12Z) - Learning from Imperfect Annotations [15.306536555936692]
Many machine learning systems today are trained on large amounts of human-annotated data.
We propose a new end-to-end framework that enables us to merge the aggregation step with model training.
We show accuracy gains of up to 25% over the current state-of-the-art approaches for aggregating annotations.
arXiv Detail & Related papers (2020-04-07T15:21:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.