Flow Matching for Tabular Data Synthesis
- URL: http://arxiv.org/abs/2512.00698v1
- Date: Sun, 30 Nov 2025 02:18:04 GMT
- Title: Flow Matching for Tabular Data Synthesis
- Authors: Bahrul Ilmi Nasution, Floor Eijkelboom, Mark Elliot, Richard Allmendinger, Christian A. Naesseth,
- Abstract summary: Flow matching is an important tool for privacy-preserving data sharing.<n>This paper compares flow matching with a state-of-the-art diffusion method.<n>We find that flow matching, particularly TabbyFlow, outperforms diffusion baselines.
- Score: 6.009900118732673
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic data generation is an important tool for privacy-preserving data sharing. While diffusion models have set recent benchmarks, flow matching (FM) offers a promising alternative. This paper presents different ways to implement flow matching for tabular data synthesis. We provide a comprehensive empirical study that compares flow matching (FM and variational FM) with a state-of-the-art diffusion method (TabDDPM and TabSyn) in tabular data synthesis. We evaluate both the standard Optimal Transport (OT) and the Variance Preserving (VP) probability paths, and also compare deterministic and stochastic samplers -- something possible when learning to generate using \textit{variational} flow matching -- characterising the empirical relationship between data utility and privacy risk. Our key findings reveal that flow matching, particularly TabbyFlow, outperforms diffusion baselines. Flow matching methods also achieves better performance with remarkably low function evaluations ($\leq$ 100 steps), offering a substantial computational advantage. The choice of probability path is also crucial, as using the OT path demonstrates superior performance, while VP has potential for producing synthetic data with lower disclosure risk. Lastly, our results show that making flows stochastic not only preserves marginal distributions but, in some instances, enables the generation of high utility synthetic data with reduced disclosure risk.
Related papers
- GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis [54.39598154430305]
We propose a Data-to-Data Flow Matching framework that learns deterministic transformations directly between paired views.<n>PDG-FM constrains flow trajectories using geodesic interpolants derived from probability density metrics of pretrained diffusion models.<n>These results highlight the advantages of incorporating data-dependent geometric regularization into deterministic flow matching for consistent novel view generation.
arXiv Detail & Related papers (2026-03-01T09:30:11Z) - Path-Guided Flow Matching for Dataset Distillation [9.761850986508895]
We propose the first flow matching-based framework for generative distillation, which enables fast deterministic synthesis by solving an ODE in a few steps.<n>We develop a continuous path-to-prototype guidance algorithm for ODE-consistent path control, which allows trajectories to reliably land on assigned prototypes.
arXiv Detail & Related papers (2026-02-05T12:52:32Z) - Exponential Family Variational Flow Matching for Tabular Data Generation [10.161936647987517]
We develop TabbyFlow, a variational Flow Matching (VFM) method for tabular data generation.<n>We introduce Exponential Family Variational Flow Matching (EF-VFM), which represents heterogeneous data types.<n>We also establish a connection between variational flow matching and generalized flow matching objectives based on Bregman divergences.
arXiv Detail & Related papers (2025-06-06T10:07:48Z) - LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion [49.898152180805454]
Synthetic datasets must maintain domain-specific logical consistency.<n>Existing generative models often overlook these inter-column relationships.<n>This study presents the first method to effectively preserve inter-column relationships without requiring domain knowledge.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all mixed-type distributions of tabular data in one model.<n>Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.<n>TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - Multi-objective evolutionary GAN for tabular data synthesis [0.873811641236639]
Synthetic data has a key role to play in data sharing by statistical agencies and other generators of statistical data products.
This paper proposes a smart MO evolutionary conditional GAN (SMOE-CTGAN) for synthetic data.
Our results indicate that SMOE-CTGAN is able to discover synthetic datasets with different risk and utility levels for multiple national census datasets.
arXiv Detail & Related papers (2024-04-15T23:07:57Z) - Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models [14.651592234678722]
Current diffusion models tend to inherit bias in the training dataset and generate biased synthetic data.<n>We introduce a novel model that incorporates sensitive guidance to generate fair synthetic data with balanced joint distributions of the target label and sensitive attributes.<n>Our method effectively mitigates bias in training data while maintaining the quality of the generated samples.
arXiv Detail & Related papers (2024-04-12T06:08:43Z) - Theoretical Insights for Diffusion Guidance: A Case Study for Gaussian
Mixture Models [59.331993845831946]
Diffusion models benefit from instillation of task-specific information into the score function to steer the sample generation towards desired properties.
This paper provides the first theoretical study towards understanding the influence of guidance on diffusion models in the context of Gaussian mixture models.
arXiv Detail & Related papers (2024-03-03T23:15:48Z) - FedTabDiff: Federated Learning of Diffusion Probabilistic Models for
Synthetic Mixed-Type Tabular Data Generation [5.824064631226058]
We introduce textitFederated Tabular Diffusion (FedTabDiff) for generating high-fidelity mixed-type tabular data without centralized access to the original datasets.
FedTabDiff realizes a decentralized learning scheme that permits multiple entities to collaboratively train a generative model while respecting data privacy and locality.
Experimental evaluations on real-world financial and medical datasets attest to the framework's capability to produce synthetic data that maintains high fidelity, utility, privacy, and coverage.
arXiv Detail & Related papers (2024-01-11T21:17:50Z) - MargCTGAN: A "Marginally'' Better CTGAN for the Low Sample Regime [63.851085173614]
MargCTGAN adds feature matching of de-correlated marginals, which results in a consistent improvement in downstream utility as well as statistical properties of the synthetic data.
arXiv Detail & Related papers (2023-07-16T10:28:49Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Semi-Supervised Learning with Normalizing Flows [54.376602201489995]
FlowGMM is an end-to-end approach to generative semi supervised learning with normalizing flows.
We show promising results on a wide range of applications, including AG-News and Yahoo Answers text data.
arXiv Detail & Related papers (2019-12-30T17:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.