Cascaded Flow Matching for Heterogeneous Tabular Data with Mixed-Type Features
- URL: http://arxiv.org/abs/2601.22816v1
- Date: Fri, 30 Jan 2026 10:42:10 GMT
- Title: Cascaded Flow Matching for Heterogeneous Tabular Data with Mixed-Type Features
- Authors: Markus Mueller, Kathrin Gruber, Dennis Fok,
- Abstract summary: We advance the state-of-the-art in diffusion models for tabular data with a cascaded approach.<n>Low-resolution representation of numerical features accounts for discrete outcomes, such as missing or inflated values.<n>Results indicate that our model generates significantly more realistic samples and captures distributional details more accurately.
- Score: 5.620334754517149
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advances in generative modeling have recently been adapted to tabular data containing discrete and continuous features. However, generating mixed-type features that combine discrete states with an otherwise continuous distribution in a single feature remains challenging. We advance the state-of-the-art in diffusion models for tabular data with a cascaded approach. We first generate a low-resolution version of a tabular data row, that is, the collection of the purely categorical features and a coarse categorical representation of numerical features. Next, this information is leveraged in the high-resolution flow matching model via a novel guided conditional probability path and data-dependent coupling. The low-resolution representation of numerical features explicitly accounts for discrete outcomes, such as missing or inflated values, and therewith enables a more faithful generation of mixed-type features. We formally prove that this cascade tightens the transport cost bound. The results indicate that our model generates significantly more realistic samples and captures distributional details more accurately, for example, the detection score increases by 40%.
Related papers
- Diffusion-Driven High-Dimensional Variable Selection [6.993247097440294]
We propose a resample-aggregate framework that exploits diffusion models' ability to generate high-fidelity synthetic data.<n>We show that the proposed method is selection consistent under mild assumptions.<n>Our method advances variable selection methodology and broadens the toolkit for interpretable, statistically rigorous analysis.
arXiv Detail & Related papers (2025-08-19T14:54:20Z) - TabRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation [16.907006955584343]
Diffusion models have been the predominant generative model for data generation.<n>We present TabRep, a training architecture trained with a unified continuous representation.<n>Our results showcase that TabRep achieves superior performance across a broad suite of evaluations.
arXiv Detail & Related papers (2025-04-07T07:44:27Z) - TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all mixed-type distributions of tabular data in one model.<n>Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.<n>TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - Discrete Flow Matching [74.04153927689313]
We present a novel discrete flow paradigm designed specifically for generating discrete data.
Our approach is capable of generating high-quality discrete data in a non-autoregressive fashion.
arXiv Detail & Related papers (2024-07-22T12:33:27Z) - Continuous Diffusion for Mixed-Type Tabular Data [2.7992435001846827]
We propose CDTD, a Continuous Diffusion model for mixed-type Tabular Data.<n>It is based on a novel combination of score matching and score enforcing a unified continuous noise distribution for both continuous and categorical features.<n>Our experimental results show that CDTD consistently outperforms state-of-the-art benchmark models.
arXiv Detail & Related papers (2023-12-16T12:21:03Z) - Generalization Bound for Diffusion Models using Random Features [0.0]
We present a diffusion model-inspired deep random feature model that is interpretable.<n>We derive generalization bounds between the distribution of sampled data and the true distribution using properties of score matching.<n>We validate our findings by generating samples on the fashion MNIST dataset and instrumental audio data.
arXiv Detail & Related papers (2023-10-06T17:59:05Z) - ChiroDiff: Modelling chirographic data with Diffusion Models [132.5223191478268]
We introduce a powerful model-class namely "Denoising Diffusion Probabilistic Models" or DDPMs for chirographic data.
Our model named "ChiroDiff", being non-autoregressive, learns to capture holistic concepts and therefore remains resilient to higher temporal sampling rate.
arXiv Detail & Related papers (2023-04-07T15:17:48Z) - Score Approximation, Estimation and Distribution Recovery of Diffusion
Models on Low-Dimensional Data [68.62134204367668]
This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace.
We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated.
The generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution.
arXiv Detail & Related papers (2023-02-14T17:02:35Z) - Breaking the Spurious Causality of Conditional Generation via Fairness
Intervention with Corrective Sampling [77.15766509677348]
Conditional generative models often inherit spurious correlations from the training dataset.
This can result in label-conditional distributions that are imbalanced with respect to another latent attribute.
We propose a general two-step strategy to mitigate this issue.
arXiv Detail & Related papers (2022-12-05T08:09:33Z) - Score-based Continuous-time Discrete Diffusion Models [102.65769839899315]
We extend diffusion models to discrete variables by introducing a Markov jump process where the reverse process denoises via a continuous-time Markov chain.
We show that an unbiased estimator can be obtained via simple matching the conditional marginal distributions.
We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
arXiv Detail & Related papers (2022-11-30T05:33:29Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.