Synthetic Tabular Data Generation: A Comparative Survey for Modern Techniques
- URL: http://arxiv.org/abs/2507.11590v1
- Date: Tue, 15 Jul 2025 14:57:23 GMT
- Title: Synthetic Tabular Data Generation: A Comparative Survey for Modern Techniques
- Authors: Raju Challagundla, Mohsen Dorodchi, Pu Wang, Minwoo Lee,
- Abstract summary: As privacy regulations become more stringent and access to real-world data becomes increasingly constrained, synthetic data generation has emerged as a vital solution.<n>This review prioritizes the actionable goals that drive synthetic data creation, including conditional generation and risk-sensitive modeling.
- Score: 6.744437741221969
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As privacy regulations become more stringent and access to real-world data becomes increasingly constrained, synthetic data generation has emerged as a vital solution, especially for tabular datasets, which are central to domains like finance, healthcare and the social sciences. This survey presents a comprehensive and focused review of recent advances in synthetic tabular data generation, emphasizing methods that preserve complex feature relationships, maintain statistical fidelity, and satisfy privacy requirements. A key contribution of this work is the introduction of a novel taxonomy based on practical generation objectives, including intended downstream applications, privacy guarantees, and data utility, directly informing methodological design and evaluation strategies. Therefore, this review prioritizes the actionable goals that drive synthetic data creation, including conditional generation and risk-sensitive modeling. Additionally, the survey proposes a benchmark framework to align technical innovation with real-world demands. By bridging theoretical foundations with practical deployment, this work serves as both a roadmap for future research and a guide for implementing synthetic tabular data in privacy-critical environments.
Related papers
- Synthetic Tabular Data: Methods, Attacks and Defenses [12.374541748245843]
Synthetic data is often positioned as a solution to replace sensitive fixed-size datasets with a source of unlimited matching data, freed from privacy concerns.<n>There has been much progress in synthetic data generation over the last decade, leveraging corresponding advances in machine learning and data analytics.
arXiv Detail & Related papers (2025-06-06T14:16:57Z) - A Comprehensive Survey of Synthetic Tabular Data Generation [31.576625554168473]
Tabular data is one of the most prevalent and important data formats in real-world applications such as healthcare, finance, and education.<n>This survey aims to provide researchers and practitioners with a holistic understanding of the field.
arXiv Detail & Related papers (2025-04-23T08:33:34Z) - A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond [53.56796220109518]
Different use cases demand synthetic data to comply with different requirements to be useful in practice.<n>Four types of requirements are reviewed: utility of the synthetic data, alignment of the synthetic data with domain-specific knowledge, statistical fidelity of the synthetic data distribution compared to the real data distribution, and privacy-preserving capabilities.<n>We discuss future directions for the field, along with opportunities to improve the current evaluation methods.
arXiv Detail & Related papers (2025-03-07T21:47:11Z) - Opinion: Revisiting synthetic data classifications from a privacy perspective [42.12937192948916]
Synthetic data is emerging as a cost-effective solution to meet the increasing data demands of AI development.<n>Traditional classification of synthetic data types does not reflect the ever-increasing methods to generate synthetic data.<n>We make a case for an alternative approach to grouping synthetic data types that better reflect privacy perspectives.
arXiv Detail & Related papers (2025-03-05T13:54:13Z) - LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation.<n>LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space.<n>Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data [104.30479583607918]
2nd FRCSyn-onGoing challenge is based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024.<n>We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition.
arXiv Detail & Related papers (2024-12-02T11:12:01Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - A primer on synthetic health data [0.2770822269241974]
Recent advances in deep generative models have greatly expanded the potential to create realistic synthetic health datasets.
These synthetic datasets aim to preserve the characteristics, patterns, and overall scientific conclusions without disclosing patient identity or sensitive information.
However, many questions and challenges remain, including how to consistently evaluate a synthetic dataset's similarity and predictive utility.
arXiv Detail & Related papers (2024-01-31T08:13:35Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Statistical properties and privacy guarantees of an original
distance-based fully synthetic data generation method [0.0]
This work shows the technical feasibility of generating publicly releasable synthetic data using a multi-step framework.
By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative.
arXiv Detail & Related papers (2023-10-10T12:29:57Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.