Related papers: Generating Realistic Tabular Data with Large Language Models

Generating Realistic Tabular Data with Large Language Models

URL: http://arxiv.org/abs/2410.21717v1
Date: Tue, 29 Oct 2024 04:14:32 GMT
Title: Generating Realistic Tabular Data with Large Language Models
Authors: Dang Nguyen, Sunil Gupta, Kien Do, Thin Nguyen, Svetha Venkatesh,
Abstract summary: Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
Score: 49.03536886067729
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While most generative models show achievements in image data generation, few are developed for tabular data generation. Recently, due to success of large language models (LLM) in diverse tasks, they have also been used for tabular data generation. However, these methods do not capture the correct correlation between the features and the target variable, hindering their applications in downstream predictive tasks. To address this problem, we propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. First, we propose a novel permutation strategy for the input data in the fine-tuning phase. Second, we propose a feature-conditional sampling approach to generate synthetic samples. Finally, we generate the labels by constructing prompts based on the generated samples to query our fine-tuned LLM. Our extensive experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks. It also produces highly realistic synthetic samples in terms of quality and diversity. More importantly, classifiers trained with our synthetic data can even compete with classifiers trained with the original data on half of the benchmark datasets, which is a significant achievement in tabular data generation.

Related papers

Less is More: Adaptive Coverage for Synthetic Training Data [20.136698279893857]
This study introduces a novel sampling algorithm, based on the maximum coverage problem, to select a representative subset from a synthetically generated dataset. Our results demonstrate that training a classifier on this contextually sampled subset achieves superior performance compared to training on the entire dataset.
arXiv Detail & Related papers (2025-04-20T06:45:16Z)
Assessing Generative Models for Structured Data [0.0]
This paper introduces rigorous methods for assessing synthetic data against real data by looking at inter-column dependencies within the data. We find that large language models (GPT-2), both when queried via few-shot prompting, and when fine-tuned, and GAN (CTGAN) models do not produce data with dependencies that mirror the original real data.
arXiv Detail & Related papers (2025-03-26T18:19:05Z)
Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
Few-shot LLM Synthetic Data with Distribution Matching [37.55363714371521]
Large language models (LLMs) produce high-quality synthetic data to enhance the performance of smaller models. LLMs-generated synthetic data often differs from the real data in key language attributes. We introduce SynAlign: a synthetic data generation and filtering framework based on key attribute distribution matching.
arXiv Detail & Related papers (2025-02-09T16:43:32Z)
BARE: Leveraging Base Language Models for Few-Shot Synthetic Data Generation [71.46236155101032]
Current data generation methods rely on seed sets containing tens of thousands of examples to prompt instruction-tuned models.<n>We show that when working with only a few seed examples, instruction-tuned models produce insufficient diversity for downstream tasks.<n>We propose Base-Refine, a novel two-stage method that combines the diversity of base models with the quality assurance of instruction-tuned models.
arXiv Detail & Related papers (2025-02-03T00:12:40Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Efficacy of Synthetic Data as a Benchmark [3.2968976262860408]
We investigate the effectiveness of generating synthetic data through large language models (LLMs) Our experiments show that while synthetic data can effectively capture performance of various methods for simpler tasks, it falls short for more complex tasks like named entity recognition. We propose a new metric called the bias factor, which evaluates the biases introduced when the same LLM is used to both generate benchmarking data and to perform the tasks.
arXiv Detail & Related papers (2024-09-18T13:20:23Z)
Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small. We propose a novel method that augments training data by incorporating a wealth of examples from other datasets. This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z)
TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets. We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances. A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z)
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias [92.41919689753051]
Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. We investigate training data generation with diversely attributed prompts, which have the potential to yield diverse and attributed generated data. We show that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance.
arXiv Detail & Related papers (2023-06-28T03:31:31Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback [21.168991554983815]
We propose a progressive zero-shot dataset generation framework, ProGen, to guide the generation of new training data. We show ProGen achieves on-par or superior performance with only 1% synthetic dataset size.
arXiv Detail & Related papers (2022-10-22T02:07:10Z)
Language Models are Realistic Tabular Data Generators [15.851912974874116]
We propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative large language model (LLMs) to sample synthetic and yet highly realistic data. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles.
arXiv Detail & Related papers (2022-10-12T15:03:28Z)
Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets [27.562256973255728]
Natural language processing models often exploit spurious correlations between task-independent features and labels in datasets to perform well only within the distributions they are trained on. We propose to tackle this problem by generating a debiased version of a dataset, which can then be used to train a debiased, off-the-shelf model. Our approach consists of 1) a method for training data generators to generate high-quality, label-consistent data samples; and 2) a filtering mechanism for removing data points that contribute to spurious correlations.
arXiv Detail & Related papers (2022-03-24T09:08:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.