A Comprehensive Survey of Synthetic Tabular Data Generation
- URL: http://arxiv.org/abs/2504.16506v2
- Date: Sat, 10 May 2025 06:10:06 GMT
- Title: A Comprehensive Survey of Synthetic Tabular Data Generation
- Authors: Ruxue Shi, Yili Wang, Mengnan Du, Xu Shen, Xin Wang,
- Abstract summary: Tabular data is one of the most prevalent and critical data formats across diverse real-world applications.<n>It is often constrained by challenges such as data scarcity, privacy concerns, and class imbalance.<n>Synthetic data generation has emerged as a promising solution, leveraging generative models to learn the distribution of real datasets.
- Score: 27.112327373017457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tabular data remains one of the most prevalent and critical data formats across diverse real-world applications. However, its effective use in machine learning (ML) is often constrained by challenges such as data scarcity, privacy concerns, and class imbalance. Synthetic data generation has emerged as a promising solution, leveraging generative models to learn the distribution of real datasets and produce high-fidelity, privacy-preserving samples. Various generative paradigms have been explored, including energy-based models (EBMs), variational autoencoders (VAEs), generative adversarial networks (GANs), large language models (LLMs), and diffusion models. While several surveys have investigated synthetic tabular data generation, most focus on narrow subdomains or specific generative methods, such as GANs, diffusion models, or privacy-preserving techniques. This limited scope often results in fragmented insights, lacking a comprehensive synthesis that bridges diverse approaches. In particular, recent advances driven by LLMs and diffusion-based models remain underexplored. This gap hinders a holistic understanding of the field`s evolution, methodological interplay, and open challenges. To address this, our survey provides a unified and systematic review of synthetic tabular data generation. Our contributions are threefold: (1) we propose a comprehensive taxonomy that organizes existing methods into traditional approaches, diffusion-based methods, and LLM-based models, and provide an in-depth comparative analysis; (2) we detail the complete pipeline for synthetic tabular data generation, including data synthesis, post-processing, and evaluation; (3) we identify major challenges, explore real-world applications, and outline open research questions and future directions to guide future work in this rapidly evolving area.
Related papers
- Synthetic Tabular Data Generation: A Comparative Survey for Modern Techniques [6.744437741221969]
As privacy regulations become more stringent and access to real-world data becomes increasingly constrained, synthetic data generation has emerged as a vital solution.<n>This review prioritizes the actionable goals that drive synthetic data creation, including conditional generation and risk-sensitive modeling.
arXiv Detail & Related papers (2025-07-15T14:57:23Z) - Anomaly Detection and Generation with Diffusion Models: A Survey [51.61574868316922]
Anomaly detection (AD) plays a pivotal role across diverse domains, including cybersecurity, finance, healthcare, and industrial manufacturing.<n>Recent advancements in deep learning, specifically diffusion models (DMs), have sparked significant interest.<n>This survey aims to guide researchers and practitioners in leveraging DMs for innovative AD solutions across diverse applications.
arXiv Detail & Related papers (2025-06-11T03:29:18Z) - An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval [51.10419281315848]
We conduct an empirical study to explore the potential of synthetic data for Text-Based Person Retrieval (TBPR) research.<n>We propose an inter-class image generation pipeline, in which an automatic prompt construction strategy is introduced.<n>We develop an intra-class image augmentation pipeline, in which the generative AI models are applied to further edit the images.
arXiv Detail & Related papers (2025-03-28T06:18:15Z) - Empowering Time Series Analysis with Synthetic Data: A Survey and Outlook in the Era of Foundation Models [104.17057231661371]
Time series analysis is crucial for understanding dynamics of complex systems.
Recent advances in foundation models have led to task-agnostic Time Series Foundation Models (TSFMs) and Large Language Model-based Time Series Models (TSLLMs)
Their success depends on large, diverse, and high-quality datasets, which are challenging to build due to regulatory, diversity, quality, and quantity constraints.
This survey provides a comprehensive review of synthetic data for TSFMs and TSLLMs, analyzing data generation strategies, their role in model pretraining, fine-tuning, and evaluation, and identifying future research directions.
arXiv Detail & Related papers (2025-03-14T13:53:46Z) - A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond [53.56796220109518]
Different use cases demand synthetic data to comply with different requirements to be useful in practice.<n>Four types of requirements are reviewed: utility of the synthetic data, alignment of the synthetic data with domain-specific knowledge, statistical fidelity of the synthetic data distribution compared to the real data distribution, and privacy-preserving capabilities.<n>We discuss future directions for the field, along with opportunities to improve the current evaluation methods.
arXiv Detail & Related papers (2025-03-07T21:47:11Z) - LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation.<n>LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space.<n>Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation [37.43210238341124]
This survey examines the key aspects of data-centric AI, emphasizing feature selection and feature generation as essential techniques for data space refinement.<n>We provide a systematic review of feature selection methods, which identify and retain the most relevant data attributes, and feature generation approaches, which create new features to simplify the capture of complex data patterns.
arXiv Detail & Related papers (2025-01-17T21:05:09Z) - Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data [104.30479583607918]
2nd FRCSyn-onGoing challenge is based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024.<n>We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition.
arXiv Detail & Related papers (2024-12-02T11:12:01Z) - Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic.
Our approach transforms numerical data into text, re-framing data generation as a language modeling task.
Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z) - Tabular Data Synthesis with Differential Privacy: A Survey [24.500349285858597]
Data sharing is a prerequisite for collaborative innovation, enabling organizations to leverage diverse datasets for deeper insights.
Data synthesis tackles this by generating artificial datasets that preserve the statistical characteristics of real data.
Differentially private data synthesis has emerged as a promising approach to privacy-aware data sharing.
arXiv Detail & Related papers (2024-11-04T06:32:48Z) - Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios [8.062368743143388]
We propose a novel methodology for generating realistic and reliable synthetic data with Deep Generative Models (DGMs) in limited real-data environments.
Our approach proposes several ways to generate an artificial inductive bias in a DGM through transfer learning and meta-learning techniques.
We validate our approach using two state-of-the-art DGMs, namely, a Variational Autoencoder and a Generative Adversarial Network, to show that our artificial inductive bias fuels superior synthetic data quality.
arXiv Detail & Related papers (2024-07-03T12:53:42Z) - On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey [26.670507323784616]
Large Language Models (LLMs) offer a data-centric solution to alleviate the limitations of real-world data with synthetic data generation.
This paper provides an organization of relevant studies based on a generic workflow of synthetic data generation.
arXiv Detail & Related papers (2024-06-14T07:47:09Z) - Differentially Private Tabular Data Synthesis using Large Language Models [6.6376578496141585]
This paper introduces DP-LLMTGen -- a novel framework for differentially private tabular data synthesis.
DP-LLMTGen models sensitive datasets using a two-stage fine-tuning procedure.
It generates synthetic data through sampling the fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-03T15:43:57Z) - A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.<n>Existing literature surveys only focus on a certain type of specific modality data.<n>We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - Data Augmentation in Human-Centric Vision [54.97327269866757]
This survey presents a comprehensive analysis of data augmentation techniques in human-centric vision tasks.
It delves into a wide range of research areas including person ReID, human parsing, human pose estimation, and pedestrian detection.
Our work categorizes data augmentation methods into two main types: data generation and data perturbation.
arXiv Detail & Related papers (2024-03-13T16:05:18Z) - Comprehensive Exploration of Synthetic Data Generation: A Survey [4.485401662312072]
This work surveys 417 Synthetic Data Generation models over the last decade.
The findings reveal increased model performance and complexity, with neural network-based approaches prevailing.
Computer vision dominates, with GANs as primary generative models, while diffusion models, transformers, and RNNs compete.
arXiv Detail & Related papers (2024-01-04T20:23:51Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Deep Generative Models, Synthetic Tabular Data, and Differential
Privacy: An Overview and Synthesis [2.8391355909797644]
This article provides a comprehensive synthesis of the recent developments in synthetic data generation via deep generative models.
We specifically outline the importance of synthetic data generation in the context of privacy-sensitive data.
arXiv Detail & Related papers (2023-07-28T09:17:03Z) - SoK: Privacy-Preserving Data Synthesis [72.92263073534899]
This paper focuses on privacy-preserving data synthesis (PPDS) by providing a comprehensive overview, analysis, and discussion of the field.
We put forth a master recipe that unifies two prominent strands of research in PPDS: statistical methods and deep learning (DL)-based methods.
arXiv Detail & Related papers (2023-07-05T08:29:31Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Machine Learning for Synthetic Data Generation: A Review [23.073056971997715]
This paper reviews existing studies that employ machine learning models for the purpose of generating synthetic data.
The review encompasses various perspectives, starting with the applications of synthetic data generation, spanning computer vision, speech, natural language processing, healthcare, and business domains.
The paper also addresses the crucial aspects of privacy and fairness concerns related to synthetic data generation.
arXiv Detail & Related papers (2023-02-08T13:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.