TabuLa: Harnessing Language Models for Tabular Data Synthesis
        - URL: http://arxiv.org/abs/2310.12746v1
- Date: Thu, 19 Oct 2023 13:50:56 GMT
- Title: TabuLa: Harnessing Language Models for Tabular Data Synthesis
- Authors: Zilong Zhao, Robert Birke and Lydia Chen
- Abstract summary: We develop Tabula, a new type of data synthesizer based on the language model structure.
We show that Tabula averagely reduces 46.2% training time per epoch compared to current LLMs-based state-of-the-art algorithm.
We also propose a token sequence compression strategy to significantly reduce training time while preserving the quality of synthetic data.
- Score: 5.102332247789348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Given the ubiquitous use of tabular data in industries and the growing
concerns in data privacy and security, tabular data synthesis emerges as a
critical research area. The recent state-of-the-art methods show that large
language models (LLMs) can be adopted to generate realistic tabular data. As
LLMs pre-process tabular data as full text, they have the advantage of avoiding
the curse of dimensionality associated with one-hot encoding high-dimensional
data. However, their long training time and limited re-usability on new tasks
prevent them from replacing exiting tabular generative models. In this paper,
we propose Tabula, a tabular data synthesizer based on the language model
structure. Through Tabula, we demonstrate the inherent limitation of employing
pre-trained language models designed for natural language processing (NLP) in
the context of tabular data synthesis. Our investigation delves into the
development of a dedicated foundational model tailored specifically for tabular
data synthesis. Additionally, we propose a token sequence compression strategy
to significantly reduce training time while preserving the quality of synthetic
data. Extensive experiments on six datasets demonstrate that using a language
model structure without loading the well-trained model weights yields a better
starting model for tabular data synthesis. Moreover, the Tabula model,
previously trained on other tabular data, serves as an excellent foundation
model for new tabular data synthesis tasks. Additionally, the token sequence
compression method substantially reduces the model's training time. Results
show that Tabula averagely reduces 46.2% training time per epoch comparing to
current LLMs-based state-of-the-art algorithm and consistently achieves even
higher synthetic data utility.
 
      
        Related papers
        - TableDreamer: Progressive and Weakness-guided Data Synthesis from   Scratch for Table Instruction Tuning [18.178908245791582]
 TableDreamer is a progressive and weakness-guided data synthesis framework for table instruction tuning.<n>It boosts the average accuracy of Llama3.1-8B-instruct by 11.62% (49.07% to 60.69%) with 27K GPT-4o synthetic data.<n>It outperforms state-of-the-art data synthesis baselines which use more training data.
 arXiv  Detail & Related papers  (2025-06-10T09:57:59Z)
- LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical   Relationship Preservation [49.898152180805454]
 This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation.
LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space.
Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
 arXiv  Detail & Related papers  (2025-03-04T00:47:52Z)
- Tabby: Tabular Data Synthesis with Language Models [11.309789039228496]
 Tabby is a simple but powerful post-training modification to the standard Transformer language model architecture.
We show that Tabby results in data quality near or equal to that of real data.
 arXiv  Detail & Related papers  (2025-03-04T00:32:15Z)
- Transformers Boost the Performance of Decision Trees on Tabular Data   across Sample Sizes [135.68092471784516]
 We propose a simple and lightweight approach for fusing large language models and gradient-boosted decision trees.
We name our fusion methods LLM-Boost and PFN-Boost, respectively.
We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms.
 arXiv  Detail & Related papers  (2025-02-04T19:30:41Z)
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
 Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
 arXiv  Detail & Related papers  (2024-10-29T04:14:32Z)
- TabDPT: Scaling Tabular Foundation Models [20.00390825519329]
 We show how to harness the power of real data to improve performance and generalization.
Our model achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks.
 TabDPT also demonstrates strong scaling as both model size and amount of available data increase.
 arXiv  Detail & Related papers  (2024-10-23T18:00:00Z)
- TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning   Benchmarks [30.922069185335246]
 We find two common characteristics of tabular data in typical industrial applications that are underrepresented in the datasets usually used for evaluation in the literature.
A considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines.
This can have an impact on the absolute and relative number of predictive, uninformative, and correlated features compared to academic datasets.
 arXiv  Detail & Related papers  (2024-06-27T17:55:31Z)
- LaTable: Towards Large Tabular Models [63.995130144110156]
 Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
 arXiv  Detail & Related papers  (2024-06-25T16:03:50Z)
- Why Tabular Foundation Models Should Be a Research Priority [65.75744962286538]
 Tabular data is the dominant modality in many fields, yet it is given hardly any research attention and significantly lags behind in terms of scale and power.
We believe the time is now to start developing tabular foundation models, or what we coin a Large Tabular Model (LTM)
 arXiv  Detail & Related papers  (2024-05-02T10:05:16Z)
- Rethinking Pre-Training in Tabular Data: A Neighborhood Embedding   Perspective [71.45945607871715]
 We propose Tabular data Pre-Training via Meta-representation (TabPTM)
The core idea is to embed data instances into a shared feature space, where each instance is represented by its distance to a fixed number of nearest neighbors and their labels.
Extensive experiments on 101 datasets confirm TabPTM's effectiveness in both classification and regression tasks, with and without fine-tuning.
 arXiv  Detail & Related papers  (2023-10-31T18:03:54Z)
- AutoDiff: combining Auto-encoder and Diffusion model for tabular data
  synthesizing [12.06889830487286]
 Diffusion model has become a main paradigm for synthetic data generation in modern machine learning.
In this paper, we leverage the power of diffusion model for generating synthetic tabular data.
The resulting synthetic tables show nice statistical fidelities to the real data, and perform well in downstream tasks for machine learning utilities.
 arXiv  Detail & Related papers  (2023-10-24T03:15:19Z)
- Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
  Language Models by Extrapolating Errors from Small Models [69.76066070227452]
 *Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
 arXiv  Detail & Related papers  (2023-10-20T17:14:25Z)
- Generating tabular datasets under differential privacy [0.0]
 We introduce Differential Privacy (DP) into the training process of deep neural networks.
This creates a trade-off between the quality and privacy of the resulting data.
We implement novel end-to-end models that leverage attention mechanisms.
 arXiv  Detail & Related papers  (2023-08-28T16:35:43Z)
- Privately generating tabular data using language models [80.67328256105891]
 Privately generating synthetic data from a table is an important brick of a privacy-first world.
We propose and investigate a simple approach of treating each row in a table as a sentence and training a language model with differential privacy.
 arXiv  Detail & Related papers  (2023-06-07T21:53:14Z)
- Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
 We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
 arXiv  Detail & Related papers  (2023-05-16T06:37:38Z)
- Leveraging Data Recasting to Enhance Tabular Reasoning [21.970920861791015]
 Prior work has mostly relied on two data generation strategies.
The first is human annotation, which yields linguistically diverse data but is difficult to scale.
The second category for creation is synthetic generation, which is scalable and cost effective but lacks inventiveness.
 arXiv  Detail & Related papers  (2022-11-23T00:04:57Z)
- PTab: Using the Pre-trained Language Model for Modeling Tabular Data [5.791972449406902]
 Recent studies show that neural-based models are effective in learning contextual representation for Tabular data.
We propose a novel framework PTab, using the Pre-trained language model to model Tabular data.
Our method has achieved a better average AUC score in supervised settings compared to the state-of-the-art baselines.
 arXiv  Detail & Related papers  (2022-09-15T08:58:42Z)
- Tabular Transformers for Modeling Multivariate Time Series [30.717890753132824]
 Tabular datasets are ubiquitous in data science applications. Given their importance, it seems natural to apply state-of-the-art deep learning algorithms in order to fully unlock their potential.
Here we propose neural network models that represent tabular time series that can leverage their hierarchical structure.
We demonstrate our models on two datasets: a synthetic credit card transaction dataset, where the learned representations are used for fraud detection and synthetic data generation, and on a real pollution dataset, where the learned encodings are used to predict atmospheric pollutant concentrations.
 arXiv  Detail & Related papers  (2020-11-03T16:58:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.