TabTreeFormer: Tabular Data Generation Using Hybrid Tree-Transformer
- URL: http://arxiv.org/abs/2501.01216v6
- Date: Fri, 16 May 2025 11:34:38 GMT
- Title: TabTreeFormer: Tabular Data Generation Using Hybrid Tree-Transformer
- Authors: Jiayu Li, Bingyin Zhao, Zilong Zhao, Uzair Javaid, Kevin Yee, Biplab Sikdar,
- Abstract summary: TabTreeFormer is a hybrid transformer architecture that integrates inductive biases of tree-based models.<n>We show that TabTreeFormer consistently outperforms baselines in utility, fidelity, and privacy metrics with competitive efficiency.
- Score: 14.330758748478281
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have shown impressive results in tabular data generation. However, they lack domain-specific inductive biases which are critical for preserving the intrinsic characteristics of tabular data. They also suffer from poor scalability and efficiency due to quadratic computational complexity. In this paper, we propose TabTreeFormer, a hybrid transformer architecture that integrates inductive biases of tree-based models (i.e., non-smoothness and non-rotational invariance) to effectively handle the discrete and weakly correlated features in tabular datasets. To improve numerical fidelity and capture multimodal distributions, we introduce a novel tokenizer that learns token sequences based on the complexity of tabular values. This reduces vocabulary size and sequence length, yielding more compact and efficient representations without sacrificing performance. We evaluate TabTreeFormer on nine diverse datasets, benchmarking against eight generative models. We show that TabTreeFormer consistently outperforms baselines in utility, fidelity, and privacy metrics with competitive efficiency. Notably, in scenarios prioritizing data utility over privacy and efficiency, the best variant of TabTreeFormer delivers a 44% performance gain relative to its baseline variant.
Related papers
- RO-FIGS: Efficient and Expressive Tree-Based Ensembles for Tabular Data [10.610270769561811]
Tree-based models are robust to uninformative features and can accurately capture non-smooth, complex decision boundaries.<n>We propose Random oblique Fast Interpretable Greedy-Tree Sums (RO-FIGS)<n>RO-FIGS builds on Fast Interpretable Greedy-Tree Sums, and extends it by learning trees with oblique or multivariate splits.<n>We evaluate RO-FIGS on 22 real-world datasets, demonstrating superior performance and much smaller models over other tree- and neural network-based methods.
arXiv Detail & Related papers (2025-04-09T14:35:24Z) - TabRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation [16.907006955584343]
Diffusion models have been the predominant generative model for data generation.
We present TabRep, a training architecture trained with a unified continuous representation.
Our results showcase that TabRep achieves superior performance across a broad suite of evaluations.
arXiv Detail & Related papers (2025-04-07T07:44:27Z) - Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization [68.07464514094299]
Existing methods encode all shapes into a fixed-size token, disregarding the inherent variations in scale and complexity across 3D data.
We introduce Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity.
Our approach reduces token counts by 50% compared to fixed-size methods while maintaining comparable visual quality.
arXiv Detail & Related papers (2025-04-03T17:57:52Z) - How Well Does Your Tabular Generator Learn the Structure of Tabular Data? [10.974400005358193]
In this paper, we introduce TabStruct, a novel evaluation benchmark that positions structural fidelity as a core evaluation dimension.<n>We show that structural fidelity offers a task-independent, domain-agnostic evaluation dimension.
arXiv Detail & Related papers (2025-03-12T14:54:58Z) - CtrTab: Tabular Data Synthesis with High-Dimensional and Limited Data [16.166752861658953]
When the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models.
This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately.
We propose CtrTab to improve the performance of diffusion-based generative models in high-dimensional, low-data scenarios.
arXiv Detail & Related papers (2025-03-09T05:01:56Z) - A Closer Look at TabPFN v2: Strength, Limitation, and Extension [51.08999772842298]
Tabular Prior-data Fitted Network v2 (TabPFN v2) achieves unprecedented in-context learning accuracy across multiple datasets.
In this paper, we evaluate TabPFN v2 on over 300 datasets, confirming its exceptional generalization capabilities on small- to medium-scale tasks.
arXiv Detail & Related papers (2025-02-24T17:38:42Z) - Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes [135.68092471784516]
We propose a simple and lightweight approach for fusing large language models and gradient-boosted decision trees.<n>We name our fusion methods LLM-Boost and PFN-Boost, respectively.<n>We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms.
arXiv Detail & Related papers (2025-02-04T19:30:41Z) - Structural Entropy Guided Probabilistic Coding [52.01765333755793]
We propose a novel structural entropy-guided probabilistic coding model, named SEPC.<n>We incorporate the relationship between latent variables into the optimization by proposing a structural entropy regularization loss.<n> Experimental results across 12 natural language understanding tasks, including both classification and regression tasks, demonstrate the superior performance of SEPC.
arXiv Detail & Related papers (2024-12-12T00:37:53Z) - Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data [56.48119008663155]
This paper proposes a Diffusion-nested Autoregressive model (TabDAR) to address these issues.
We conduct extensive experiments on ten datasets with distinct properties, and the proposed TabDAR outperforms previous state-of-the-art methods by 18% to 45% on eight metrics across three distinct aspects.
arXiv Detail & Related papers (2024-10-28T20:49:26Z) - TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model.
Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.
TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - A Survey on Deep Tabular Learning [0.0]
Tabular data presents unique challenges for deep learning due to its heterogeneous nature and lack of spatial structure.
This survey reviews the evolution of deep learning models for Tabular data, from early fully connected networks (FCNs) to advanced architectures like TabNet, SAINT, TabTranSELU, and MambaNet.
arXiv Detail & Related papers (2024-10-15T20:08:08Z) - Unmasking Trees for Tabular Data [0.0]
We present UnmaskingTrees, a simple method for tabular imputation (and generation) employing gradient-boosted decision trees.<n>To solve the conditional generation subproblem, we propose BaltoBot, which fits a balanced tree of boosted tree classifiers.<n>Unlike older methods, it requires no parametric assumption on the conditional distribution, accommodating features with multimodal distributions.<n>We finally consider our two approaches as meta-algorithms, demonstrating in-context learning-based generative modeling with TabPFN.
arXiv Detail & Related papers (2024-07-08T04:15:43Z) - An improved tabular data generator with VAE-GMM integration [9.4491536689161]
We propose a novel Variational Autoencoder (VAE)-based model that addresses limitations of current approaches.
Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture.
We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones.
arXiv Detail & Related papers (2024-04-12T12:31:06Z) - Making Pre-trained Language Models Great on Tabular Prediction [50.70574370855663]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing.
We present TP-BERTa, a specifically pre-trained LM for tabular data prediction.
A novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names.
arXiv Detail & Related papers (2024-03-04T08:38:56Z) - In-Context Data Distillation with TabPFN [11.553950697974825]
In-context data distillation (ICD) is a novel methodology that effectively eliminates these constraints by optimizing TabPFN's context.
ICD efficiently enables TabPFN to handle significantly larger datasets with a fixed memory budget, improving TabPFN's quadratic memory complexity but at the cost of a linear number of tuning steps.
arXiv Detail & Related papers (2024-02-10T15:23:45Z) - Efficient Nonparametric Tensor Decomposition for Binary and Count Data [27.02813234958821]
We propose ENTED, an underlineEfficient underlineNon underlineTEnsor underlineDecomposition for binary and count tensors.
arXiv Detail & Related papers (2024-01-15T14:27:03Z) - Convergent Boosted Smoothing for Modeling Graph Data with Tabular Node
Features [46.052312251801]
We propose a framework for iterating boosting with graph propagation steps.
Our approach is anchored in a principled meta loss function.
Across a variety of non-iid graph datasets, our method achieves comparable or superior performance.
arXiv Detail & Related papers (2021-10-26T04:53:12Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.