Related papers: Generating Realistic Synthetic Relational Data through Graph Variational Autoencoders

Generating Realistic Synthetic Relational Data through Graph Variational Autoencoders

URL: http://arxiv.org/abs/2211.16889v1
Date: Wed, 30 Nov 2022 10:40:44 GMT
Title: Generating Realistic Synthetic Relational Data through Graph Variational Autoencoders
Authors: Ciro Antonio Mami, Andrea Coser, Eric Medvet, Alexander T.P. Boudewijn, Marco Volpe, Michael Whitworth, Borut Svara, Gabriele Sgroi, Daniele Panfilo, Sebastiano Saccani
Abstract summary: We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases. The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
Score: 47.89542334125886
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Synthetic data generation has recently gained widespread attention as a more reliable alternative to traditional data anonymization. The involved methods are originally developed for image synthesis. Hence, their application to the typically tabular and relational datasets from healthcare, finance and other industries is non-trivial. While substantial research has been devoted to the generation of realistic tabular datasets, the study of synthetic relational databases is still in its infancy. In this paper, we combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases. We then apply the obtained method to two publicly available databases in computational experiments. The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets, even for large datasets with advanced data types.

Related papers

PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models [51.42043158297229]
We introduce Pluel, a framework to synthesize multi-tabular relational databases from scratch.<n>In a step-by-step fashion, Pluel models (1) schemas with directed graphs, (2) inter-table primary-foreign key connectivity with bipartite graphs, and, (3) feature distributions in tables via conditional causal mechanisms.
arXiv Detail & Related papers (2026-02-03T21:35:18Z)
Generative Correlation Manifolds: Generating Synthetic Data with Preserved Higher-Order Correlations [4.551615447454767]
We introduce Generative Correlation Manifolds (GCM), a computationally efficient method for generating synthetic data.<n>We argue that this method provides a new approach to synthetic data generation with potential applications in privacy-preserving data sharing, robust model training, and simulation.
arXiv Detail & Related papers (2025-10-24T16:15:53Z)
Understanding the Influence of Synthetic Data for Text Embedders [52.04771455432998]
We first reproduce and publicly release the synthetic data proposed by Wang et al.<n>We critically examine where exactly synthetic data improves model generalization.<n>Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders.
arXiv Detail & Related papers (2025-09-07T19:28:52Z)
RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models [83.6013616017646]
RelDiff is a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure.<n>RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases.
arXiv Detail & Related papers (2025-05-31T21:01:02Z)
Assessing Generative Models for Structured Data [0.0]
This paper introduces rigorous methods for assessing synthetic data against real data by looking at inter-column dependencies within the data. We find that large language models (GPT-2), both when queried via few-shot prompting, and when fine-tuned, and GAN (CTGAN) models do not produce data with dependencies that mirror the original real data.
arXiv Detail & Related papers (2025-03-26T18:19:05Z)
LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation. LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space. Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z)
Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation [49.898152180805454]
This paper proposes three evaluation metrics designed to assess the preservation of logical relationships. We validate these metrics by assessing the performance of both classical and state-of-the-art generation methods on a real-world industrial dataset.
arXiv Detail & Related papers (2025-02-06T13:13:26Z)
Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic. Our approach transforms numerical data into text, re-framing data generation as a language modeling task. Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
IRG: Generating Synthetic Relational Databases using Deep Learning with Insightful Relational Understanding [13.724085637262654]
We propose incremental generator (IRG) that successfully handles ubiquitous real-life situations. IRG ensures the preservation of relational schema integrity, offers a deep understanding of relationships beyond direct ancestors and descendants. Experiments on three open-source real-life relational datasets in different fields at different scales demonstrate IRG's advantage in maintaining the synthetic data's relational schema validity and data fidelity and utility.
arXiv Detail & Related papers (2023-12-23T07:47:58Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
Generating tabular datasets under differential privacy [0.0]
We introduce Differential Privacy (DP) into the training process of deep neural networks. This creates a trade-off between the quality and privacy of the resulting data. We implement novel end-to-end models that leverage attention mechanisms.
arXiv Detail & Related papers (2023-08-28T16:35:43Z)
CasTGAN: Cascaded Generative Adversarial Network for Realistic Tabular Data Synthesis [0.4999814847776097]
Generative adversarial networks (GANs) have drawn considerable attention in recent years for their proven capability in generating synthetic data. The validity of the synthetic data and the underlying privacy concerns represent major challenges which are not sufficiently addressed.
arXiv Detail & Related papers (2023-07-01T16:52:18Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z)
Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z)
Causal-TGAN: Generating Tabular Data Using Causal Generative Adversarial Networks [7.232789848964222]
We propose a causal model named Causal Tabular Generative Neural Network (Causal-TGAN) to generate synthetic data. Experiments on both simulated datasets and real datasets demonstrate the better performance of our method.
arXiv Detail & Related papers (2021-04-21T17:59:41Z)
Improving Text Relationship Modeling with Artificial Data [0.07614628596146598]
We apply and evaluate a synthetic data approach to relationship classification in digital libraries. We find that for classification on whole-part relationships between books, synthetic data improves a deep neural network classifier by 91%.
arXiv Detail & Related papers (2020-10-27T22:04:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.