FuseGen: PLM Fusion for Data-generation based Zero-shot Learning
- URL: http://arxiv.org/abs/2406.12527v1
- Date: Tue, 18 Jun 2024 11:55:05 GMT
- Title: FuseGen: PLM Fusion for Data-generation based Zero-shot Learning
- Authors: Tianyuan Zou, Yang Liu, Peng Li, Jianqing Zhang, Jingjing Liu, Ya-Qin Zhang,
- Abstract summary: FuseGen is a novel data generation-based zero-shot learning framework.
It introduces a new criteria for subset selection from synthetic datasets.
The chosen subset provides in-context feedback to each PLM, enhancing dataset quality.
- Score: 18.51772808242954
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Data generation-based zero-shot learning, although effective in training Small Task-specific Models (STMs) via synthetic datasets generated by Pre-trained Language Models (PLMs), is often limited by the low quality of such synthetic datasets. Previous solutions have primarily focused on single PLM settings, where synthetic datasets are typically restricted to specific sub-spaces and often deviate from real-world distributions, leading to severe distribution bias. To mitigate such bias, we propose FuseGen, a novel data generation-based zero-shot learning framework that introduces a new criteria for subset selection from synthetic datasets via utilizing multiple PLMs and trained STMs. The chosen subset provides in-context feedback to each PLM, enhancing dataset quality through iterative data generation. Trained STMs are then used for sample re-weighting as well, further improving data quality. Extensive experiments across diverse tasks demonstrate that FuseGen substantially outperforms existing methods, highly effective in boosting STM performance in a PLM-agnostic way. Code is provided in https://github.com/LindaLydia/FuseGen.
Related papers
- Few-shot LLM Synthetic Data with Distribution Matching [37.55363714371521]
Large language models (LLMs) produce high-quality synthetic data to enhance the performance of smaller models.
LLMs-generated synthetic data often differs from the real data in key language attributes.
We introduce SynAlign: a synthetic data generation and filtering framework based on key attribute distribution matching.
arXiv Detail & Related papers (2025-02-09T16:43:32Z) - SampleLLM: Optimizing Tabular Data Synthesis in Recommendations [46.689486044254544]
Tabular data synthesis is crucial in machine learning, yet existing general methods are highly data-dependent and often fall short in recommender systems.
This limitation arises from their difficulty in capturing complex distributions and understanding feature relationships from sparse and limited data.
We propose a novel two-stage framework named SampleLLM to improve the quality of LLM-based data synthesis for recommendation tasks.
arXiv Detail & Related papers (2025-01-27T15:12:27Z) - Evaluating Language Models as Synthetic Data Generators [74.80905172696366]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities.
Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss.
Based on the findings of the entropy law, we propose a quite efficient and universal data selection method.
We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z) - Differentially Private Tabular Data Synthesis using Large Language Models [6.6376578496141585]
This paper introduces DP-LLMTGen -- a novel framework for differentially private tabular data synthesis.
DP-LLMTGen models sensitive datasets using a two-stage fine-tuning procedure.
It generates synthetic data through sampling the fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-03T15:43:57Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.