Under the Surface: Tracking the Artifactuality of LLM-Generated Data
- URL: http://arxiv.org/abs/2401.14698v2
- Date: Tue, 30 Jan 2024 05:36:06 GMT
- Title: Under the Surface: Tracking the Artifactuality of LLM-Generated Data
- Authors: Debarati Das, Karin De Langis, Anna Martin-Boyle, Jaehyung Kim, Minhwa
Lee, Zae Myung Kim, Shirley Anugrah Hayati, Risako Owan, Bin Hu, Ritik
Parkar, Ryan Koo, Jonginn Park, Aahan Tyagi, Libby Ferland, Sanjali Roy,
Vincent Liu, and Dongyeop Kang
- Abstract summary: This work delves into the expanding role of large language models (LLMs) in generating artificial data.
To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data.
Despite artificial data's capability to match human performance, this paper reveals significant hidden disparities.
- Score: 21.002983022237604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work delves into the expanding role of large language models (LLMs) in
generating artificial data. LLMs are increasingly employed to create a variety
of outputs, including annotations, preferences, instruction prompts, simulated
dialogues, and free text. As these forms of LLM-generated data often intersect
in their application, they exert mutual influence on each other and raise
significant concerns about the quality and diversity of the artificial data
incorporated into training cycles, leading to an artificial data ecosystem. To
the best of our knowledge, this is the first study to aggregate various types
of LLM-generated text data, from more tightly constrained data like "task
labels" to more lightly constrained "free-form text". We then stress test the
quality and implications of LLM-generated artificial data, comparing it with
human data across various existing benchmarks. Despite artificial data's
capability to match human performance, this paper reveals significant hidden
disparities, especially in complex tasks where LLMs often miss the nuanced
understanding of intrinsic human-generated content. This study critically
examines diverse LLM-generated data and emphasizes the need for ethical
practices in data creation and when using LLMs. It highlights the LLMs'
shortcomings in replicating human traits and behaviors, underscoring the
importance of addressing biases and artifacts produced in LLM-generated content
for future research and development. All data and code are available on our
project page.
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Regurgitative Training: The Value of Real Data in Training Large Language Models [1.2815904071470703]
We evaluate the implications of "regurgitative training" on LLM performance.
We find strong evidence that regurgitative training clearly handicaps the performance of LLMs.
We propose and evaluate three different strategies to mitigate the performance loss of regurgitative training.
arXiv Detail & Related papers (2024-07-03T18:42:55Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Cross-Data Knowledge Graph Construction for LLM-enabled Educational Question-Answering System: A~Case~Study~at~HCMUT [2.8000537365271367]
Large language models (LLMs) have emerged as a vibrant research topic.
LLMs face challenges in remembering events, incorporating new information, and addressing domain-specific issues or hallucinations.
This article proposes a method for automatically constructing a Knowledge Graph from multiple data sources.
arXiv Detail & Related papers (2024-04-14T16:34:31Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation [128.01050030936028]
We propose an information refinement training method named InFO-RAG.
InFO-RAG is low-cost and general across various tasks.
It improves the performance of LLaMA2 by an average of 9.39% relative points.
arXiv Detail & Related papers (2024-02-28T08:24:38Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z) - LLM-in-the-loop: Leveraging Large Language Model for Thematic Analysis [18.775126929754833]
Thematic analysis (TA) has been widely used for analyzing qualitative data in many disciplines and fields.
Human coders develop and deepen their data interpretation and coding over multiple iterations, making TA labor-intensive and time-consuming.
We propose a human-LLM collaboration framework (i.e., LLM-in-the-loop) to conduct TA with in-context learning (ICL)
arXiv Detail & Related papers (2023-10-23T17:05:59Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Large Language Models as Data Preprocessors [10.914067455923847]
Large Language Models (LLMs), typified by OpenAI's GPT series and Meta's LLaMA variants, have marked a significant advancement in artificial intelligence.
This study expands on the applications of LLMs, exploring their potential in data preprocessing.
We propose an LLM-based framework for data preprocessing, which integrates cutting-edge prompt engineering techniques.
arXiv Detail & Related papers (2023-08-30T23:28:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.