Related papers: Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment

Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment

URL: http://arxiv.org/abs/2501.09126v1
Date: Wed, 15 Jan 2025 20:13:46 GMT
Title: Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment
Authors: Conrad Borchers, Danielle R. Thomas, Jionghao Lin, Ralph Abboud, Kenneth R. Koedinger,
Abstract summary: Large Language Models (LLMs) can help automate text classification tasks at low cost and scale.<n>By contrast, human coding is generally more reliable but expensive to procure at scale.<n>We propose a hybrid solution to leverage the strengths of both.
Score: 4.788487793976781
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) like GPT-4o can help automate text classification tasks at low cost and scale. However, there are major concerns about the validity and reliability of LLM outputs. By contrast, human coding is generally more reliable but expensive to procure at scale. In this study, we propose a hybrid solution to leverage the strengths of both. We combine human-coded data and synthetic LLM-produced data to fine-tune a classical machine learning classifier, distilling both into a smaller BERT model. We evaluate our method on a human-coded test set as a validity measure for LLM output quality. In three experiments, we systematically vary LLM-generated samples' size, variety, and consistency, informed by best practices in LLM tuning. Our findings indicate that augmenting datasets with synthetic samples improves classifier performance, with optimal results achieved at an 80% synthetic to 20% human-coded data ratio. Lower temperature settings of 0.3, corresponding to less variability in LLM generations, produced more stable improvements but also limited model learning from augmented samples. In contrast, higher temperature settings (0.7 and above) introduced greater variability in performance estimates and, at times, lower performance. Hence, LLMs may produce more uniform output that classifiers overfit to earlier or produce more diverse output that runs the risk of deteriorating model performance through information irrelevant to the prediction task. Filtering out inconsistent synthetic samples did not enhance performance. We conclude that integrating human and LLM-generated data to improve text classification models in assessment offers a scalable solution that leverages both the accuracy of human coding and the variety of LLM outputs.

Related papers

Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications [5.713077600587505]
We use large language models (LLMs) to synthesize a high-quality dataset of error correction pairs.<n>We then adapt the synthetic data distribution to match the mobile application domain by reweighting the samples.<n>We present best practices for mixing our synthetic data with other data sources to improve model performance on error correction in both offline evaluation and production live A/B testing.
arXiv Detail & Related papers (2025-05-24T03:27:20Z)
Few-shot LLM Synthetic Data with Distribution Matching [37.55363714371521]
Large language models (LLMs) produce high-quality synthetic data to enhance the performance of smaller models. LLMs-generated synthetic data often differs from the real data in key language attributes. We introduce SynAlign: a synthetic data generation and filtering framework based on key attribute distribution matching.
arXiv Detail & Related papers (2025-02-09T16:43:32Z)
Learning to Verify Summary Facts with Fine-Grained LLM Feedback [15.007479147796403]
Training automatic summary fact verifiers often faces the challenge of a lack of human-labeled data. We introduce FineSumFact, a large-scale dataset containing fine-grained factual feedback on summaries.
arXiv Detail & Related papers (2024-12-14T05:28:44Z)
Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered, Difficult, and Noisy [5.225010551503337]
This paper proposes a data quality enhancement (DQE) method for text classification based on large language models (LLMs) Experimental results demonstrate that our method effectively enhances the performance of LLMs in text classification tasks. Our method has achieved state-of-the-art performance in several open-source classification tasks.
arXiv Detail & Related papers (2024-12-09T15:28:39Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z)
Regurgitative Training: The Value of Real Data in Training Large Language Models [1.2815904071470703]
We evaluate the implications of "regurgitative training" on LLM performance. We find strong evidence that regurgitative training clearly handicaps the performance of LLMs. We propose and evaluate three different strategies to mitigate the performance loss of regurgitative training.
arXiv Detail & Related papers (2024-07-03T18:42:55Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
Illuminating Blind Spots of Language Models with Targeted Agent-in-the-Loop Synthetic Data [9.982616173090264]
Language models (LMs) have achieved impressive accuracy across a variety of tasks but remain vulnerable to high-confidence misclassifications (UUs) UUs cluster into blind spots in the feature space, leading to significant risks in high-stakes applications. We propose a novel approach to address blind spot mitigation through the use of intelligent agents as teachers to characterize UU-type errors.
arXiv Detail & Related papers (2024-03-26T16:49:25Z)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN) At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.