TabRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation
- URL: http://arxiv.org/abs/2504.04798v4
- Date: Thu, 01 May 2025 13:02:06 GMT
- Title: TabRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation
- Authors: Jacob Si, Zijing Ou, Mike Qu, Zhengrui Xiang, Yingzhen Li,
- Abstract summary: Diffusion models have been the predominant generative model for data generation.<n>We present TabRep, a training architecture trained with a unified continuous representation.<n>Our results showcase that TabRep achieves superior performance across a broad suite of evaluations.
- Score: 16.907006955584343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models have been the predominant generative model for tabular data generation. However, they face the conundrum of modeling under a separate versus a unified data representation. The former encounters the challenge of jointly modeling all multi-modal distributions of tabular data in one model. While the latter alleviates this by learning a single representation for all features, it currently leverages sparse suboptimal encoding heuristics and necessitates additional computation costs. In this work, we address the latter by presenting TabRep, a tabular diffusion architecture trained with a unified continuous representation. To motivate the design of our representation, we provide geometric insights into how the data manifold affects diffusion models. The key attributes of our representation are composed of its density, flexibility to provide ample separability for nominal features, and ability to preserve intrinsic relationships. Ultimately, TabRep provides a simple yet effective approach for training tabular diffusion models under a continuous data manifold. Our results showcase that TabRep achieves superior performance across a broad suite of evaluations. It is the first to synthesize tabular data that exceeds the downstream quality of the original datasets while preserving privacy and remaining computationally efficient.
Related papers
- Representation Learning for Tabular Data: A Comprehensive Survey [23.606506938919605]
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications.
Deep Neural Networks (DNNs) have recently demonstrated promising results through their capability of representation learning.
We organize existing methods into three main categories according to their generalization capabilities.
arXiv Detail & Related papers (2025-04-17T17:58:23Z) - CtrTab: Tabular Data Synthesis with High-Dimensional and Limited Data [16.166752861658953]
When the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models.
This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately.
We propose CtrTab to improve the performance of diffusion-based generative models in high-dimensional, low-data scenarios.
arXiv Detail & Related papers (2025-03-09T05:01:56Z) - A Closer Look at TabPFN v2: Strength, Limitation, and Extension [51.08999772842298]
Tabular Prior-data Fitted Network v2 (TabPFN v2) achieves unprecedented in-context learning accuracy across multiple datasets.<n>In this paper, we evaluate TabPFN v2 on over 300 datasets, confirming its exceptional generalization capabilities on small- to medium-scale tasks.
arXiv Detail & Related papers (2025-02-24T17:38:42Z) - TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all mixed-type distributions of tabular data in one model.<n>Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.<n>TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - Distribution-Aware Data Expansion with Diffusion Models [55.979857976023695]
We propose DistDiff, a training-free data expansion framework based on the distribution-aware diffusion model.
DistDiff consistently enhances accuracy across a diverse range of datasets compared to models trained solely on original data.
arXiv Detail & Related papers (2024-03-11T14:07:53Z) - Debiasing Multimodal Models via Causal Information Minimization [65.23982806840182]
We study bias arising from confounders in a causal graph for multimodal data.
Robust predictive features contain diverse information that helps a model generalize to out-of-distribution data.
We use these features as confounder representations and use them via methods motivated by causal theory to remove bias from models.
arXiv Detail & Related papers (2023-11-28T16:46:14Z) - MissDiff: Training Diffusion Models on Tabular Data with Missing Values [29.894691645801597]
This work presents a unified and principled diffusion-based framework for learning from data with missing values.
We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective.
We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
arXiv Detail & Related papers (2023-07-02T03:49:47Z) - TabDDPM: Modelling Tabular Data with Diffusion Models [33.202222842342465]
We introduce TabDDPM -- a diffusion model that can be universally applied to any dataset and handles any type of feature.
We extensively evaluate TabDDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives.
arXiv Detail & Related papers (2022-09-30T12:26:14Z) - Generative Models as Distributions of Functions [72.2682083758999]
Generative models are typically trained on grid-like data such as images.
In this paper, we abandon discretized grids and instead parameterize individual data points by continuous functions.
arXiv Detail & Related papers (2021-02-09T11:47:55Z) - Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations.
Our framework well preserves the relations between samples.
By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.