IRG: Generating Synthetic Relational Databases using Deep Learning with Insightful Relational Understanding
- URL: http://arxiv.org/abs/2312.15187v2
- Date: Mon, 30 Dec 2024 02:41:33 GMT
- Title: IRG: Generating Synthetic Relational Databases using Deep Learning with Insightful Relational Understanding
- Authors: Jiayu Li, Zilong Zhao, Vikram Chundawat, Biplab Sikdar, Y. C. Tay,
- Abstract summary: We propose incremental generator (IRG) that successfully handles ubiquitous real-life situations.
IRG ensures the preservation of relational schema integrity, offers a deep understanding of relationships beyond direct ancestors and descendants.
Experiments on three open-source real-life relational datasets in different fields at different scales demonstrate IRG's advantage in maintaining the synthetic data's relational schema validity and data fidelity and utility.
- Score: 13.724085637262654
- License:
- Abstract: Synthetic data has numerous applications, including but not limited to software testing at scale, privacy-preserving data sharing to enable smoother collaboration between stakeholders, and data augmentation for analytical and machine learning tasks. Relational databases, which are commonly used by corporations, governments, and financial institutions, present unique challenges for synthetic data generation due to their complex structures. Existing synthetic relational database generation approaches often assume idealized scenarios, such as every table having a perfect primary key column without composite and potentially overlapping primary or foreign key constraints, and fail to account for the sequential nature of certain tables. In this paper, we propose incremental relational generator (IRG), that successfully handles these ubiquitous real-life situations. IRG ensures the preservation of relational schema integrity, offers a deep contextual understanding of relationships beyond direct ancestors and descendants, leverages the power of newly designed deep neural networks, and scales efficiently to handle larger datasets--a combination never achieved in previous works. Experiments on three open-source real-life relational datasets in different fields at different scales demonstrate IRG's advantage in maintaining the synthetic data's relational schema validity and data fidelity and utility.
Related papers
- RelGNN: Composite Message Passing for Relational Deep Learning [56.48834369525997]
We introduce RelGNN, a novel GNN framework specifically designed to capture the unique characteristics of relational databases.
At the core of our approach is the introduction of atomic routes, which are sequences of nodes forming high-order tripartite structures.
RelGNN consistently achieves state-of-the-art accuracy with up to 25% improvement.
arXiv Detail & Related papers (2025-02-10T18:58:40Z) - Towards Privacy-Preserving Relational Data Synthesis via Probabilistic Relational Models [3.877001015064152]
Probabilistic relational models provide a well-established formalism to combine first-order logic and probabilistic models.
The field of artificial intelligence requires increasingly large amounts of relational training data for various machine learning tasks.
Collecting real-world data is often challenging due to privacy concerns, data protection regulations, high costs, and so on.
arXiv Detail & Related papers (2024-09-06T11:24:25Z) - RelBench: A Benchmark for Deep Learning on Relational Databases [78.52438155603781]
We present RelBench, a public benchmark for solving tasks over databases with graph neural networks.
We use RelBench to conduct the first comprehensive study of Deep Learning infrastructure.
RDL learns better whilst reducing human work needed by more than an order of magnitude.
arXiv Detail & Related papers (2024-07-29T14:46:13Z) - Differentially Private Synthetic Data Generation for Relational Databases [9.532509662034062]
We introduce the first-of-its-kind algorithm that can be combined with any existing differentially private (DP) synthetic data generation mechanisms.
Our algorithm iteratively refines the relationship between individual synthetic tables to minimize their approximation errors.
arXiv Detail & Related papers (2024-05-29T00:25:07Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - GFS: Graph-based Feature Synthesis for Prediction over Relational
Databases [39.975491511390985]
We propose a novel framework called Graph-based Feature Synthesis (GFS)
GFS formulates relational database as a heterogeneous graph database.
In an experiment over four real-world multi-table relational databases, GFS outperforms previous methods designed for relational databases.
arXiv Detail & Related papers (2023-12-04T16:54:40Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Generating Realistic Synthetic Relational Data through Graph Variational
Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases.
The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z) - Federated Learning with GAN-based Data Synthesis for Non-IID Clients [8.304185807036783]
Federated learning (FL) has recently emerged as a popular privacy-preserving collaborative learning paradigm.
We propose a novel framework, named Synthetic Data Aided Federated Learning (SDA-FL), to resolve this non-IID challenge by sharing synthetic data.
arXiv Detail & Related papers (2022-06-11T11:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.