Under the Surface: Tracking the Artifactuality of LLM-Generated Data
- URL: http://arxiv.org/abs/2401.14698v2
- Date: Tue, 30 Jan 2024 05:36:06 GMT
- Title: Under the Surface: Tracking the Artifactuality of LLM-Generated Data
- Authors: Debarati Das, Karin De Langis, Anna Martin-Boyle, Jaehyung Kim, Minhwa
Lee, Zae Myung Kim, Shirley Anugrah Hayati, Risako Owan, Bin Hu, Ritik
Parkar, Ryan Koo, Jonginn Park, Aahan Tyagi, Libby Ferland, Sanjali Roy,
Vincent Liu, and Dongyeop Kang
- Abstract summary: This work delves into the expanding role of large language models (LLMs) in generating artificial data.
To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data.
Despite artificial data's capability to match human performance, this paper reveals significant hidden disparities.
- Score: 21.002983022237604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work delves into the expanding role of large language models (LLMs) in
generating artificial data. LLMs are increasingly employed to create a variety
of outputs, including annotations, preferences, instruction prompts, simulated
dialogues, and free text. As these forms of LLM-generated data often intersect
in their application, they exert mutual influence on each other and raise
significant concerns about the quality and diversity of the artificial data
incorporated into training cycles, leading to an artificial data ecosystem. To
the best of our knowledge, this is the first study to aggregate various types
of LLM-generated text data, from more tightly constrained data like "task
labels" to more lightly constrained "free-form text". We then stress test the
quality and implications of LLM-generated artificial data, comparing it with
human data across various existing benchmarks. Despite artificial data's
capability to match human performance, this paper reveals significant hidden
disparities, especially in complex tasks where LLMs often miss the nuanced
understanding of intrinsic human-generated content. This study critically
examines diverse LLM-generated data and emphasizes the need for ethical
practices in data creation and when using LLMs. It highlights the LLMs'
shortcomings in replicating human traits and behaviors, underscoring the
importance of addressing biases and artifacts produced in LLM-generated content
for future research and development. All data and code are available on our
project page.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement [51.601916604301685]
Large language models (LLMs) generate content that can undermine trust in online discourse.
Current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-AI collaboration.
To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content.
arXiv Detail & Related papers (2024-10-18T08:14:10Z) - LLM-PBE: Assessing Data Privacy in Large Language Models [111.58198436835036]
Large Language Models (LLMs) have become integral to numerous domains, significantly advancing applications in data management, mining, and analysis.
Despite the critical nature of this issue, there has been no existing literature to offer a comprehensive assessment of data privacy risks in LLMs.
Our paper introduces LLM-PBE, a toolkit crafted specifically for the systematic evaluation of data privacy risks in LLMs.
arXiv Detail & Related papers (2024-08-23T01:37:29Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Regurgitative Training: The Value of Real Data in Training Large Language Models [1.2815904071470703]
We evaluate the implications of "regurgitative training" on LLM performance.
We find strong evidence that regurgitative training clearly handicaps the performance of LLMs.
We propose and evaluate three different strategies to mitigate the performance loss of regurgitative training.
arXiv Detail & Related papers (2024-07-03T18:42:55Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Cross-Data Knowledge Graph Construction for LLM-enabled Educational Question-Answering System: A Case Study at HCMUT [2.8000537365271367]
Large language models (LLMs) have emerged as a vibrant research topic.
LLMs face challenges in remembering events, incorporating new information, and addressing domain-specific issues or hallucinations.
This article proposes a method for automatically constructing a Knowledge Graph from multiple data sources.
arXiv Detail & Related papers (2024-04-14T16:34:31Z) - LLM-in-the-loop: Leveraging Large Language Model for Thematic Analysis [18.775126929754833]
Thematic analysis (TA) has been widely used for analyzing qualitative data in many disciplines and fields.
Human coders develop and deepen their data interpretation and coding over multiple iterations, making TA labor-intensive and time-consuming.
We propose a human-LLM collaboration framework (i.e., LLM-in-the-loop) to conduct TA with in-context learning (ICL)
arXiv Detail & Related papers (2023-10-23T17:05:59Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Large Language Models as Data Preprocessors [9.99065004972981]
Large Language Models (LLMs) have marked a significant advancement in artificial intelligence.
This study explores their potential in data preprocessing, a critical stage in data mining and analytics applications.
We propose an LLM-based framework for data preprocessing, which integrates cutting-edge prompt engineering techniques.
arXiv Detail & Related papers (2023-08-30T23:28:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.