Related papers: TabMT: Generating tabular data with masked transformers

TabMT: Generating tabular data with masked transformers

URL: http://arxiv.org/abs/2312.06089v1
Date: Mon, 11 Dec 2023 03:28:11 GMT
Title: TabMT: Generating tabular data with masked transformers
Authors: Manbir S Gulati, Paul F Roysdon
Abstract summary: Masked Transformers are incredibly effective as generative models and classifiers. This work contributes to the exploration of transformer-based models in synthetic data generation for diverse application domains.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive and Masked Transformers are incredibly effective as generative models and classifiers. While these models are most prevalent in NLP, they also exhibit strong performance in other domains, such as vision. This work contributes to the exploration of transformer-based models in synthetic data generation for diverse application domains. In this paper, we present TabMT, a novel Masked Transformer design for generating synthetic tabular data. TabMT effectively addresses the unique challenges posed by heterogeneous data fields and is natively able to handle missing data. Our design leverages improved masking techniques to allow for generation and demonstrates state-of-the-art performance from extremely small to extremely large tabular datasets. We evaluate TabMT for privacy-focused applications and find that it is able to generate high quality data with superior privacy tradeoffs.

Related papers

LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation. LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space. Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data. TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z)
A Survey on Deep Tabular Learning [0.0]
Tabular data presents unique challenges for deep learning due to its heterogeneous nature and lack of spatial structure. This survey reviews the evolution of deep learning models for Tabular data, from early fully connected networks (FCNs) to advanced architectures like TabNet, SAINT, TabTranSELU, and MambaNet.
arXiv Detail & Related papers (2024-10-15T20:08:08Z)
TAEGAN: Generating Synthetic Tabular Data For Data Augmentation [13.612237747184363]
Tabular Auto-Encoder Generative Adversarial Network (TAEGAN) is an improved GAN-based framework for generating high-quality tabular data. TAEGAN employs a masked auto-encoder as the generator, which for the first time introduces the power of self-supervised pre-training.
arXiv Detail & Related papers (2024-10-02T18:33:06Z)
DataGen: Unified Synthetic Dataset Generation via Large Language Models [88.16197692794707]
DataGen is a comprehensive framework designed to produce diverse, accurate, and highly controllable datasets.<n>To augment data diversity, DataGen incorporates an attribute-guided generation module and a group checking feature.<n>Extensive experiments demonstrate the superior quality of data generated by DataGen.
arXiv Detail & Related papers (2024-06-27T07:56:44Z)
Deep Learning with Tabular Data: A Self-supervised Approach [0.0]
We have used a self-supervised learning approach in this study. The aim is to find the most effective TabTransformer model representation of categorical and numerical features. The research has presented with a novel approach by creating various variants of TabTransformer model.
arXiv Detail & Related papers (2024-01-26T23:12:41Z)
Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM) A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences. Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z)
Exploring the Benefits of Differentially Private Pre-training and Parameter-Efficient Fine-tuning for Table Transformers [56.00476706550681]
Table Transformer (TabTransformer) is a state-of-the-art neural network model, while Differential Privacy (DP) is an essential component to ensure data privacy. In this paper, we explore the benefits of combining these two aspects together in the scenario of transfer learning.
arXiv Detail & Related papers (2023-09-12T19:08:26Z)
Generating tabular datasets under differential privacy [0.0]
We introduce Differential Privacy (DP) into the training process of deep neural networks. This creates a trade-off between the quality and privacy of the resulting data. We implement novel end-to-end models that leverage attention mechanisms.
arXiv Detail & Related papers (2023-08-28T16:35:43Z)
Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction. TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification. It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z)
Language Models are Realistic Tabular Data Generators [15.851912974874116]
We propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative large language model (LLMs) to sample synthetic and yet highly realistic data. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles.
arXiv Detail & Related papers (2022-10-12T15:03:28Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.