Related papers: Language Models are Realistic Tabular Data Generators

Language Models are Realistic Tabular Data Generators

URL: http://arxiv.org/abs/2210.06280v2
Date: Sat, 22 Apr 2023 10:03:23 GMT
Title: Language Models are Realistic Tabular Data Generators
Authors: Vadim Borisov, Kathrin Se{\ss}ler, Tobias Leemann, Martin Pawelczyk, Gjergji Kasneci
Abstract summary: We propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative large language model (LLMs) to sample synthetic and yet highly realistic data. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles.
Score: 15.851912974874116
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data's characteristics remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as variational autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based large language models (LLMs), which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. Furthermore, GReaT can model tabular data distributions by conditioning on any subset of features; the remaining features are sampled without additional overhead. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles. We find that GReaT maintains state-of-the-art performance across numerous real-world and synthetic data sets with heterogeneous feature types coming in various sizes.

Related papers

Assessing Generative Models for Structured Data [0.0]
This paper introduces rigorous methods for assessing synthetic data against real data by looking at inter-column dependencies within the data. We find that large language models (GPT-2), both when queried via few-shot prompting, and when fine-tuned, and GAN (CTGAN) models do not produce data with dependencies that mirror the original real data.
arXiv Detail & Related papers (2025-03-26T18:19:05Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
DataGen: Unified Synthetic Dataset Generation via Large Language Models [88.16197692794707]
DataGen is a comprehensive framework designed to produce diverse, accurate, and highly controllable datasets.<n>To augment data diversity, DataGen incorporates an attribute-guided generation module and a group checking feature.<n>Extensive experiments demonstrate the superior quality of data generated by DataGen.
arXiv Detail & Related papers (2024-06-27T07:56:44Z)
CTSyn: A Foundational Model for Cross Tabular Data Generation [9.568990880984813]
Cross-Table Synthesizer (CTSyn) is a diffusion-based foundational model tailored for tabular data generation. CTSyn significantly outperforms existing table synthesizers in utility and diversity. It also uniquely enhances performances of downstream machine learning beyond what is achievable with real data.
arXiv Detail & Related papers (2024-06-07T04:04:21Z)
Differentially Private Tabular Data Synthesis using Large Language Models [6.6376578496141585]
This paper introduces DP-LLMTGen -- a novel framework for differentially private tabular data synthesis. DP-LLMTGen models sensitive datasets using a two-stage fine-tuning procedure. It generates synthetic data through sampling the fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-03T15:43:57Z)
Generative Modeling for Tabular Data via Penalized Optimal Transport Network [2.0319002824093015]
Wasserstein generative adversarial network (WGAN) is a notable improvement in generative modeling. We propose POTNet, a generative deep neural network based on a novel, robust, and interpretable marginally-penalized Wasserstein (MPW) loss.
arXiv Detail & Related papers (2024-02-16T05:27:05Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
CasTGAN: Cascaded Generative Adversarial Network for Realistic Tabular Data Synthesis [0.4999814847776097]
Generative adversarial networks (GANs) have drawn considerable attention in recent years for their proven capability in generating synthetic data. The validity of the synthetic data and the underlying privacy concerns represent major challenges which are not sufficiently addressed.
arXiv Detail & Related papers (2023-07-01T16:52:18Z)
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias [92.41919689753051]
Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. We investigate training data generation with diversely attributed prompts, which have the potential to yield diverse and attributed generated data. We show that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance.
arXiv Detail & Related papers (2023-06-28T03:31:31Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)
Partially Conditioned Generative Adversarial Networks [75.08725392017698]
Generative Adversarial Networks (GANs) let one synthesise artificial datasets by implicitly modelling the underlying probability distribution of a real-world training dataset. With the introduction of Conditional GANs and their variants, these methods were extended to generating samples conditioned on ancillary information available for each sample within the dataset. In this work, we argue that standard Conditional GANs are not suitable for such a task and propose a new Adversarial Network architecture and training strategy.
arXiv Detail & Related papers (2020-07-06T15:59:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.