TabuLa: Harnessing Language Models for Tabular Data Synthesis
- URL: http://arxiv.org/abs/2310.12746v1
- Date: Thu, 19 Oct 2023 13:50:56 GMT
- Title: TabuLa: Harnessing Language Models for Tabular Data Synthesis
- Authors: Zilong Zhao, Robert Birke and Lydia Chen
- Abstract summary: We develop Tabula, a new type of data synthesizer based on the language model structure.
We show that Tabula averagely reduces 46.2% training time per epoch compared to current LLMs-based state-of-the-art algorithm.
We also propose a token sequence compression strategy to significantly reduce training time while preserving the quality of synthetic data.
- Score: 5.102332247789348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given the ubiquitous use of tabular data in industries and the growing
concerns in data privacy and security, tabular data synthesis emerges as a
critical research area. The recent state-of-the-art methods show that large
language models (LLMs) can be adopted to generate realistic tabular data. As
LLMs pre-process tabular data as full text, they have the advantage of avoiding
the curse of dimensionality associated with one-hot encoding high-dimensional
data. However, their long training time and limited re-usability on new tasks
prevent them from replacing exiting tabular generative models. In this paper,
we propose Tabula, a tabular data synthesizer based on the language model
structure. Through Tabula, we demonstrate the inherent limitation of employing
pre-trained language models designed for natural language processing (NLP) in
the context of tabular data synthesis. Our investigation delves into the
development of a dedicated foundational model tailored specifically for tabular
data synthesis. Additionally, we propose a token sequence compression strategy
to significantly reduce training time while preserving the quality of synthetic
data. Extensive experiments on six datasets demonstrate that using a language
model structure without loading the well-trained model weights yields a better
starting model for tabular data synthesis. Moreover, the Tabula model,
previously trained on other tabular data, serves as an excellent foundation
model for new tabular data synthesis tasks. Additionally, the token sequence
compression method substantially reduces the model's training time. Results
show that Tabula averagely reduces 46.2% training time per epoch comparing to
current LLMs-based state-of-the-art algorithm and consistently achieves even
higher synthetic data utility.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Why Tabular Foundation Models Should Be a Research Priority [65.75744962286538]
Tabular data is the dominant modality in many fields, yet it is given hardly any research attention and significantly lags behind in terms of scale and power.
We believe the time is now to start developing tabular foundation models, or what we coin a Large Tabular Model (LTM)
arXiv Detail & Related papers (2024-05-02T10:05:16Z) - AutoDiff: combining Auto-encoder and Diffusion model for tabular data
synthesizing [12.06889830487286]
Diffusion model has become a main paradigm for synthetic data generation in modern machine learning.
In this paper, we leverage the power of diffusion model for generating synthetic tabular data.
The resulting synthetic tables show nice statistical fidelities to the real data, and perform well in downstream tasks for machine learning utilities.
arXiv Detail & Related papers (2023-10-24T03:15:19Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Generating tabular datasets under differential privacy [0.0]
We introduce Differential Privacy (DP) into the training process of deep neural networks.
This creates a trade-off between the quality and privacy of the resulting data.
We implement novel end-to-end models that leverage attention mechanisms.
arXiv Detail & Related papers (2023-08-28T16:35:43Z) - Privately generating tabular data using language models [80.67328256105891]
Privately generating synthetic data from a table is an important brick of a privacy-first world.
We propose and investigate a simple approach of treating each row in a table as a sentence and training a language model with differential privacy.
arXiv Detail & Related papers (2023-06-07T21:53:14Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - Leveraging Data Recasting to Enhance Tabular Reasoning [21.970920861791015]
Prior work has mostly relied on two data generation strategies.
The first is human annotation, which yields linguistically diverse data but is difficult to scale.
The second category for creation is synthetic generation, which is scalable and cost effective but lacks inventiveness.
arXiv Detail & Related papers (2022-11-23T00:04:57Z) - Tabular Transformers for Modeling Multivariate Time Series [30.717890753132824]
Tabular datasets are ubiquitous in data science applications. Given their importance, it seems natural to apply state-of-the-art deep learning algorithms in order to fully unlock their potential.
Here we propose neural network models that represent tabular time series that can leverage their hierarchical structure.
We demonstrate our models on two datasets: a synthetic credit card transaction dataset, where the learned representations are used for fraud detection and synthetic data generation, and on a real pollution dataset, where the learned encodings are used to predict atmospheric pollutant concentrations.
arXiv Detail & Related papers (2020-11-03T16:58:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.