Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation
- URL: http://arxiv.org/abs/2404.09682v3
- Date: Tue, 24 Sep 2024 02:35:41 GMT
- Title: Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation
- Authors: Juhwan Choi, Jungmin Yun, Kyohoon Jin, YoungBin Kim,
- Abstract summary: We present a case study that extends the application of LLM-based data annotation to enhance the quality of existing datasets through a cleansing strategy.
Specifically, we leverage approaches such as chain-of-thought and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset.
- Score: 9.497148303350697
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The quality of the dataset is crucial for ensuring optimal performance and reliability of downstream task models. However, datasets often contain noisy data inadvertently included during the construction process. Numerous attempts have been made to correct this issue through human annotators. However, hiring and managing human annotators is expensive and time-consuming. As an alternative, recent studies are exploring the use of large language models (LLMs) for data annotation. In this study, we present a case study that extends the application of LLM-based data annotation to enhance the quality of existing datasets through a cleansing strategy. Specifically, we leverage approaches such as chain-of-thought and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset, which is widely used for the multi-document summarization task. Through our proposed cleansing method, we introduce an enhanced Multi-News+. By employing LLMs for data cleansing, we demonstrate an efficient and effective approach to improving dataset quality without relying on expensive human annotation efforts.
Related papers
- Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection [23.378592856800168]
Large Language Models (LLMs) can be used to automate the annotation process.
This study investigates whether LLMs are viable for annotating the complex task of media bias detection.
We create annolexical, the first large-scale dataset for media bias classification.
arXiv Detail & Related papers (2024-11-17T14:14:36Z) - Efficient Alignment of Large Language Models via Data Sampling [0.4915744683251149]
We propose an information theory-based methodology for efficient alignment by identifying a small high quality subset.
We find that the model aligned using our proposed methodology outperforms other sampling methods and performs comparable to the model aligned with the full dataset.
arXiv Detail & Related papers (2024-11-15T19:36:15Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Making Large Language Models Better Data Creators [22.0882632635255]
Large language models (LLMs) have advanced the state-of-the-art in NLP significantly.
deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security.
We propose a unified data creation pipeline that requires only a single format example.
arXiv Detail & Related papers (2023-10-31T01:08:34Z) - LLMaAA: Making Large Language Models as Active Annotators [32.57011151031332]
We propose LLMaAA, which takes large language models as annotators and puts them into an active learning loop to determine what to annotate efficiently.
We conduct experiments and analysis on two classic NLP tasks, named entity recognition and relation extraction.
With LLMaAA, task-specific models trained from LLM-generated labels can outperform the teacher within only hundreds of annotated examples.
arXiv Detail & Related papers (2023-10-30T14:54:15Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.