Related papers: Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?

Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?

URL: http://arxiv.org/abs/2309.01669v2
Date: Thu, 22 Feb 2024 09:16:47 GMT
Title: Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?
Authors: Leon Weber-Genzel and Robert Litschko and Ekaterina Artemova and Barbara Plank
Abstract summary: We present a first and novel benchmark for Error Detection (AED) on instruction tuning data: DONKII. We find that all three datasets contain clear errors, which sometimes propagate directly into instruction-tuned LLMs. Our results show that the choice of the right AED method and model size is indeed crucial and derive practical recommendations for how to use AED methods to clean instruction-tuning data.
Score: 29.072740239139087
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Instruction tuning has become an integral part of training pipelines for Large Language Models (LLMs) and has been shown to yield strong performance gains. In an orthogonal line of research, Annotation Error Detection (AED) has emerged as a tool for detecting quality problems in gold standard labels. So far, however, the application of AED methods has been limited to classification tasks. It is an open question how well AED methods generalize to language generation settings, which are becoming more widespread via LLMs. In this paper, we present a first and novel benchmark for AED on instruction tuning data: DONKII. It comprises three instruction-tuning datasets enriched with error annotations by experts and semi-automatic methods. We also provide a novel taxonomy of error types for instruction-tuning data. We find that all three datasets contain clear errors, which sometimes propagate directly into instruction-tuned LLMs. We propose four AED baselines for the generative setting and evaluate them extensively on the newly introduced dataset. Our results show that the choice of the right AED method and model size is indeed crucial and derive practical recommendations for how to use AED methods to clean instruction-tuning data.

Related papers

ZeroED: Hybrid Zero-shot Error Detection through Large Language Model Reasoning [45.352592886478774]
We propose ZeroED, a novel hybrid zero-shot error detection framework. ZeroED operates in four steps, i.e., feature representation, error labeling, training data construction, and detector training. Experiments show ZeroED substantially outperforms state-of-the-art methods by a maximum 30% improvement in F1 score and up to 90% token cost reduction.
arXiv Detail & Related papers (2025-04-06T10:28:41Z)
Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE) RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation. Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data [51.34222224728979]
We propose a novel approach that uses the first half of a random text from OpenWebText as the instruction and GPT-3.5-turbo or GPT-4-turbo to complete the text as the response. Despite the data being "non-instructional", we found that pre-trained LLMs fine-tuned on this data can gain instruction-following capabilities.
arXiv Detail & Related papers (2024-08-27T01:21:53Z)
Text3DAug -- Prompted Instance Augmentation for LiDAR Perception [1.1633929083694388]
LiDAR data of urban scenarios poses unique challenges, such as heterogeneous characteristics and inherent class imbalance. We propose Text3DAug, a novel approach leveraging generative models for instance augmentation. Text3DAug does not depend on labeled data and is the first of its kind to generate instances and annotations from text.
arXiv Detail & Related papers (2024-08-26T13:16:03Z)
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? [60.50127555651554]
Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features. This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks. We introduce a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs.
arXiv Detail & Related papers (2024-03-11T15:48:56Z)
ExaRanker-Open: Synthetic Explanation for IR using Open-Source LLMs [60.81649785463651]
We introduce ExaRanker-Open, where we adapt and explore the use of open-source language models to generate explanations. Our findings reveal that incorporating explanations consistently enhances neural rankers, with benefits escalating as the LLM size increases.
arXiv Detail & Related papers (2024-02-09T11:23:14Z)
Described Object Detection: Liberating Object Detection with Flexible Expressions [19.392927971139652]
We advance Open-Vocabulary object Detection (OVD) and Referring Expression (REC) to Described Object Detection (DOD) In this paper, we advance them to a more practical setting called Described Object Detection (DOD) by expanding category names to flexible language expressions for OVD. This dataset features flexible language expressions, whether short category names or long descriptions, and annotating all described objects on all images without omission.
arXiv Detail & Related papers (2023-07-24T14:06:54Z)
ActiveAED: A Human in the Loop Improves Annotation Error Detection [22.61786427296688]
Even widely-used benchmark datasets contain substantial number of erroneous annotations. We propose ActiveAED, an AED method that can detect errors more accurately by repeatedly querying a human for error corrections in its prediction loop. We evaluate ActiveAED on eight datasets spanning five different tasks and find that it leads to improvements over the state of the art on seven of them, with gains of up to six percentage points in average precision.
arXiv Detail & Related papers (2023-05-31T17:18:47Z)
Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation [92.2167864437497]
We propose Dynosaur, a dynamic growth paradigm for the automatic curation of instruction-tuning data. Based on the metadata of existing datasets, we use LLMs to automatically construct instruction-tuning data by identifying relevant data fields and generating appropriate instructions. By leveraging the existing annotated datasets, Dynosaur offers several advantages: 1) it reduces the API cost for generating instructions; 2) it provides high-quality data for instruction tuning; and 3) it supports the continuous improvement of models by generating instruction-tuning data when a new annotated dataset becomes available.
arXiv Detail & Related papers (2023-05-23T17:56:26Z)
Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets. We define a uniform evaluation setup including a new formalization of the annotation error detection task. We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z)
DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences. Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.