Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of
Large Language Models
- URL: http://arxiv.org/abs/2310.01119v2
- Date: Mon, 8 Jan 2024 13:09:24 GMT
- Title: Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of
Large Language Models
- Authors: Jean Kaddour, Qi Liu
- Abstract summary: Large language models can generalize to novel downstream tasks with relatively few labeled examples.
Alternatively, smaller models can solve specific tasks if fine-tuned with enough labeled examples.
We study synthetic data generation of fine-tuning training data via fine-tuned teacher LLMs to improve the downstream performance of much smaller models.
- Score: 15.991777903345575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The in-context learning ability of large language models (LLMs) enables them
to generalize to novel downstream tasks with relatively few labeled examples.
However, they require enormous computational resources to be deployed.
Alternatively, smaller models can solve specific tasks if fine-tuned with
enough labeled examples. These examples, however, are expensive to obtain. In
pursuit of the best of both worlds, we study synthetic data generation of
fine-tuning training data via fine-tuned teacher LLMs to improve the downstream
performance of much smaller models. In four text classification and two text
generation tasks, we find that both data generation and annotation dramatically
improve the respective downstream model's performance, occasionally
necessitating only a minor fraction of the original training dataset.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - A synthetic data approach for domain generalization of NLI models [13.840374911669167]
Natural Language Inference (NLI) remains an important benchmark task for LLMs.
We show that synthetic high-quality datasets can adapt NLI models for zero-shot use in downstream applications.
We show that models trained on this data have the best generalization to completely new downstream test settings.
arXiv Detail & Related papers (2024-02-19T18:55:16Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Generating Datasets with Pretrained Language Models [12.919486518128734]
We show how large language models can be leveraged to obtain high-quality embeddings without requiring labeled data, finetuning or modifications to the pretraining objective.
We utilize the generative abilities of PLMs to generate entire datasets of labeled text pairs from scratch, which can then be used for regular finetuning of much smaller models.
arXiv Detail & Related papers (2021-04-15T15:51:41Z) - Generation-Distillation for Efficient Natural Language Understanding in
Low-Data Settings [5.929956715430167]
Transfer learning with large-scale language models (LM) has led to dramatic performance improvements across a broad range of natural language understanding tasks.
The size and memory footprint of these large LMs makes them difficult to deploy in many scenarios.
Recent research points to knowledge distillation as a potential solution, showing that when training data for a given task is abundant, it is possible to distill a large (teacher) LM into a small task-specific (student) network with minimal loss of performance.
arXiv Detail & Related papers (2020-01-25T08:20:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.