TabuLa: Harnessing Language Models for Tabular Data Synthesis
- URL: http://arxiv.org/abs/2310.12746v1
- Date: Thu, 19 Oct 2023 13:50:56 GMT
- Title: TabuLa: Harnessing Language Models for Tabular Data Synthesis
- Authors: Zilong Zhao, Robert Birke and Lydia Chen
- Abstract summary: We develop Tabula, a new type of data synthesizer based on the language model structure.
We show that Tabula averagely reduces 46.2% training time per epoch compared to current LLMs-based state-of-the-art algorithm.
We also propose a token sequence compression strategy to significantly reduce training time while preserving the quality of synthetic data.
- Score: 5.102332247789348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given the ubiquitous use of tabular data in industries and the growing
concerns in data privacy and security, tabular data synthesis emerges as a
critical research area. The recent state-of-the-art methods show that large
language models (LLMs) can be adopted to generate realistic tabular data. As
LLMs pre-process tabular data as full text, they have the advantage of avoiding
the curse of dimensionality associated with one-hot encoding high-dimensional
data. However, their long training time and limited re-usability on new tasks
prevent them from replacing exiting tabular generative models. In this paper,
we propose Tabula, a tabular data synthesizer based on the language model
structure. Through Tabula, we demonstrate the inherent limitation of employing
pre-trained language models designed for natural language processing (NLP) in
the context of tabular data synthesis. Our investigation delves into the
development of a dedicated foundational model tailored specifically for tabular
data synthesis. Additionally, we propose a token sequence compression strategy
to significantly reduce training time while preserving the quality of synthetic
data. Extensive experiments on six datasets demonstrate that using a language
model structure without loading the well-trained model weights yields a better
starting model for tabular data synthesis. Moreover, the Tabula model,
previously trained on other tabular data, serves as an excellent foundation
model for new tabular data synthesis tasks. Additionally, the token sequence
compression method substantially reduces the model's training time. Results
show that Tabula averagely reduces 46.2% training time per epoch comparing to
current LLMs-based state-of-the-art algorithm and consistently achieves even
higher synthetic data utility.
Related papers
- Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes [135.68092471784516]
We propose a simple and lightweight approach for fusing large language models and gradient-boosted decision trees.
We name our fusion methods LLM-Boost and PFN-Boost, respectively.
We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms.
arXiv Detail & Related papers (2025-02-04T19:30:41Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - TabDPT: Scaling Tabular Foundation Models [20.00390825519329]
We show how to harness the power of real data to improve performance and generalization.
Our model achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks.
TabDPT also demonstrates strong scaling as both model size and amount of available data increase.
arXiv Detail & Related papers (2024-10-23T18:00:00Z) - TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks [30.922069185335246]
We find two common characteristics of tabular data in typical industrial applications that are underrepresented in the datasets usually used for evaluation in the literature.
A considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines.
This can have an impact on the absolute and relative number of predictive, uninformative, and correlated features compared to academic datasets.
arXiv Detail & Related papers (2024-06-27T17:55:31Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Mixture of In-Context Prompters for Tabular PFNs [33.76194735049027]
MIXTUREPFN is the Condorcet winner across 36 diverse datasets against 19 strong deep learning and tree-based baselines.
It achieves the highest mean rank among Top-10 aforementioned algorithms with statistical significance.
arXiv Detail & Related papers (2024-05-25T09:47:59Z) - Rethinking Pre-Training in Tabular Data: A Neighborhood Embedding Perspective [71.45945607871715]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
The core idea is to embed data instances into a shared feature space, where each instance is represented by its distance to a fixed number of nearest neighbors and their labels.
Extensive experiments on 101 datasets confirm TabPTM's effectiveness in both classification and regression tasks, with and without fine-tuning.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - PTab: Using the Pre-trained Language Model for Modeling Tabular Data [5.791972449406902]
Recent studies show that neural-based models are effective in learning contextual representation for Tabular data.
We propose a novel framework PTab, using the Pre-trained language model to model Tabular data.
Our method has achieved a better average AUC score in supervised settings compared to the state-of-the-art baselines.
arXiv Detail & Related papers (2022-09-15T08:58:42Z) - TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables.
TaBERT is trained on a large corpus of 26 million tables and their English contexts.
Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.