REaLTabFormer: Generating Realistic Relational and Tabular Data using
Transformers
- URL: http://arxiv.org/abs/2302.02041v1
- Date: Sat, 4 Feb 2023 00:32:50 GMT
- Title: REaLTabFormer: Generating Realistic Relational and Tabular Data using
Transformers
- Authors: Aivin V. Solatorio and Olivier Dupriez
- Abstract summary: We introduce REaLTabFormer (Realistic and Tabular Transformer), a synthetic data generation model.
It first creates a parent table using an autoregressive GPT-2 model, then generates the relational dataset conditioned on the parent table using a sequence-to-sequence model.
Experiments using real-world datasets show that REaLTabFormer captures the relational structure better than a model baseline.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Tabular data is a common form of organizing data. Multiple models are
available to generate synthetic tabular datasets where observations are
independent, but few have the ability to produce relational datasets. Modeling
relational data is challenging as it requires modeling both a "parent" table
and its relationships across tables. We introduce REaLTabFormer (Realistic
Relational and Tabular Transformer), a tabular and relational synthetic data
generation model. It first creates a parent table using an autoregressive GPT-2
model, then generates the relational dataset conditioned on the parent table
using a sequence-to-sequence (Seq2Seq) model. We implement target masking to
prevent data copying and propose the $Q_{\delta}$ statistic and statistical
bootstrapping to detect overfitting. Experiments using real-world datasets show
that REaLTabFormer captures the relational structure better than a baseline
model. REaLTabFormer also achieves state-of-the-art results on prediction
tasks, "out-of-the-box", for large non-relational datasets without needing
fine-tuning.
Related papers
- Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data [56.48119008663155]
This paper proposes a Diffusion-nested Autoregressive model (TabDAR) to address these issues.
We conduct extensive experiments on ten datasets with distinct properties, and the proposed TabDAR outperforms previous state-of-the-art methods by 18% to 45% on eight metrics across three distinct aspects.
arXiv Detail & Related papers (2024-10-28T20:49:26Z) - TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model.
Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.
TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes [25.169832192255956]
We present TabFM, a neural tabular model for data discovery over data lakes.
We finetune the pretrained model for identifying unionable, joinable, and subset table pairs.
Our results demonstrate significant improvements in F1 scores for search compared to state-of-the-art techniques.
arXiv Detail & Related papers (2024-06-28T17:28:53Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - Retrieval-Based Transformer for Table Augmentation [14.460363647772745]
We introduce a novel approach toward automatic data wrangling.
We aim to address table augmentation tasks, including row/column population and data imputation.
Our model consistently and substantially outperforms both supervised statistical methods and the current state-of-the-art transformer-based models.
arXiv Detail & Related papers (2023-06-20T18:51:21Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - Row Conditional-TGAN for generating synthetic relational databases [0.0]
We propose the Row-Tabular Generative Adversarial Network (RC-TGAN) to support modeling and synthesizing relational databases.
The RC-TGAN models relationship information between tables by incorporating conditional data of parent rows into the design of the child table's GAN.
arXiv Detail & Related papers (2022-11-14T18:14:18Z) - Generative Modeling of Complex Data [8.201100713224003]
This paper puts forward a generic framework to synthesize more complex data structures with composite and nested types.
The results on standard benchmark datasets show that such implementation consistently outperforms current state-of-the-art models.
arXiv Detail & Related papers (2022-02-04T14:17:26Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.