Related papers: FuseGen: PLM Fusion for Data-generation based Zero-shot Learning

FuseGen: PLM Fusion for Data-generation based Zero-shot Learning

URL: http://arxiv.org/abs/2406.12527v1
Date: Tue, 18 Jun 2024 11:55:05 GMT
Title: FuseGen: PLM Fusion for Data-generation based Zero-shot Learning
Authors: Tianyuan Zou, Yang Liu, Peng Li, Jianqing Zhang, Jingjing Liu, Ya-Qin Zhang,
Abstract summary: FuseGen is a novel data generation-based zero-shot learning framework. It introduces a new criteria for subset selection from synthetic datasets. The chosen subset provides in-context feedback to each PLM, enhancing dataset quality.
Score: 18.51772808242954
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Data generation-based zero-shot learning, although effective in training Small Task-specific Models (STMs) via synthetic datasets generated by Pre-trained Language Models (PLMs), is often limited by the low quality of such synthetic datasets. Previous solutions have primarily focused on single PLM settings, where synthetic datasets are typically restricted to specific sub-spaces and often deviate from real-world distributions, leading to severe distribution bias. To mitigate such bias, we propose FuseGen, a novel data generation-based zero-shot learning framework that introduces a new criteria for subset selection from synthetic datasets via utilizing multiple PLMs and trained STMs. The chosen subset provides in-context feedback to each PLM, enhancing dataset quality through iterative data generation. Trained STMs are then used for sample re-weighting as well, further improving data quality. Extensive experiments across diverse tasks demonstrate that FuseGen substantially outperforms existing methods, highly effective in boosting STM performance in a PLM-agnostic way. Code is provided in https://github.com/LindaLydia/FuseGen.

Related papers

Few-shot LLM Synthetic Data with Distribution Matching [37.55363714371521]
Large language models (LLMs) produce high-quality synthetic data to enhance the performance of smaller models. LLMs-generated synthetic data often differs from the real data in key language attributes. We introduce SynAlign: a synthetic data generation and filtering framework based on key attribute distribution matching.
arXiv Detail & Related papers (2025-02-09T16:43:32Z)
SampleLLM: Optimizing Tabular Data Synthesis in Recommendations [46.689486044254544]
Tabular data synthesis is crucial in machine learning, yet existing general methods are highly data-dependent and often fall short in recommender systems. This limitation arises from their difficulty in capturing complex distributions and understanding feature relationships from sparse and limited data. We propose a novel two-stage framework named SampleLLM to improve the quality of LLM-based data synthesis for recommendation tasks.
arXiv Detail & Related papers (2025-01-27T15:12:27Z)
Evaluating Language Models as Synthetic Data Generators [74.80905172696366]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Efficacy of Synthetic Data as a Benchmark [3.2968976262860408]
We investigate the effectiveness of generating synthetic data through large language models (LLMs) Our experiments show that while synthetic data can effectively capture performance of various methods for simpler tasks, it falls short for more complex tasks like named entity recognition. We propose a new metric called the bias factor, which evaluates the biases introduced when the same LLM is used to both generate benchmarking data and to perform the tasks.
arXiv Detail & Related papers (2024-09-18T13:20:23Z)
Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z)
Differentially Private Tabular Data Synthesis using Large Language Models [6.6376578496141585]
This paper introduces DP-LLMTGen -- a novel framework for differentially private tabular data synthesis. DP-LLMTGen models sensitive datasets using a two-stage fine-tuning procedure. It generates synthetic data through sampling the fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-03T15:43:57Z)
Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small. We propose a novel method that augments training data by incorporating a wealth of examples from other datasets. This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z)
Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z)
Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime. We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z)
ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback [21.168991554983815]
We propose a progressive zero-shot dataset generation framework, ProGen, to guide the generation of new training data. We show ProGen achieves on-par or superior performance with only 1% synthetic dataset size.
arXiv Detail & Related papers (2022-10-22T02:07:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.