IRG: Generating Synthetic Relational Databases using Deep Learning with Insightful Relational Understanding
- URL: http://arxiv.org/abs/2312.15187v2
- Date: Mon, 30 Dec 2024 02:41:33 GMT
- Title: IRG: Generating Synthetic Relational Databases using Deep Learning with Insightful Relational Understanding
- Authors: Jiayu Li, Zilong Zhao, Vikram Chundawat, Biplab Sikdar, Y. C. Tay,
- Abstract summary: We propose incremental generator (IRG) that successfully handles ubiquitous real-life situations.<n>IRG ensures the preservation of relational schema integrity, offers a deep understanding of relationships beyond direct ancestors and descendants.<n> Experiments on three open-source real-life relational datasets in different fields at different scales demonstrate IRG's advantage in maintaining the synthetic data's relational schema validity and data fidelity and utility.
- Score: 13.724085637262654
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data has numerous applications, including but not limited to software testing at scale, privacy-preserving data sharing to enable smoother collaboration between stakeholders, and data augmentation for analytical and machine learning tasks. Relational databases, which are commonly used by corporations, governments, and financial institutions, present unique challenges for synthetic data generation due to their complex structures. Existing synthetic relational database generation approaches often assume idealized scenarios, such as every table having a perfect primary key column without composite and potentially overlapping primary or foreign key constraints, and fail to account for the sequential nature of certain tables. In this paper, we propose incremental relational generator (IRG), that successfully handles these ubiquitous real-life situations. IRG ensures the preservation of relational schema integrity, offers a deep contextual understanding of relationships beyond direct ancestors and descendants, leverages the power of newly designed deep neural networks, and scales efficiently to handle larger datasets--a combination never achieved in previous works. Experiments on three open-source real-life relational datasets in different fields at different scales demonstrate IRG's advantage in maintaining the synthetic data's relational schema validity and data fidelity and utility.
Related papers
- Boosting Relational Deep Learning with Pretrained Tabular Models [18.34233986830027]
Graph Neural Networks (GNNs) offer a compelling alternative inherently by modeling these relationships.
Our framework achieves up to $33%$ performance improvement and a $526times$ inference speedup compared to GNNs.
arXiv Detail & Related papers (2025-04-07T11:19:04Z) - LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation.
LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space.
Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - RelGNN: Composite Message Passing for Relational Deep Learning [56.48834369525997]
We introduce RelGNN, a novel GNN framework specifically designed to capture the unique characteristics of relational databases.
At the core of our approach is the introduction of atomic routes, which are sequences of nodes forming high-order tripartite structures.
RelGNN consistently achieves state-of-the-art accuracy with up to 25% improvement.
arXiv Detail & Related papers (2025-02-10T18:58:40Z) - RelBench: A Benchmark for Deep Learning on Relational Databases [78.52438155603781]
We present RelBench, a public benchmark for solving tasks over databases with graph neural networks.
We use RelBench to conduct the first comprehensive study of Deep Learning infrastructure.
RDL learns better whilst reducing human work needed by more than an order of magnitude.
arXiv Detail & Related papers (2024-07-29T14:46:13Z) - Adapting Differentially Private Synthetic Data to Relational Databases [9.532509662034062]
We introduce the first-of-its-kind algorithm that can be combined with any existing differentially private (DP) synthetic data generation mechanisms.
Our algorithm iteratively refines the relationship between individual synthetic tables to minimize their approximation errors.
arXiv Detail & Related papers (2024-05-29T00:25:07Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - GFS: Graph-based Feature Synthesis for Prediction over Relational
Databases [39.975491511390985]
We propose a novel framework called Graph-based Feature Synthesis (GFS)
GFS formulates relational database as a heterogeneous graph database.
In an experiment over four real-world multi-table relational databases, GFS outperforms previous methods designed for relational databases.
arXiv Detail & Related papers (2023-12-04T16:54:40Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - Generating Realistic Synthetic Relational Data through Graph Variational
Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases.
The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z) - Row Conditional-TGAN for generating synthetic relational databases [0.0]
We propose the Row-Tabular Generative Adversarial Network (RC-TGAN) to support modeling and synthesizing relational databases.
The RC-TGAN models relationship information between tables by incorporating conditional data of parent rows into the design of the child table's GAN.
arXiv Detail & Related papers (2022-11-14T18:14:18Z) - Generative Adversarial Networks for Synthetic Data Generation: A
Comparative Study [1.0896567381206714]
Generative Adversarial Networks (GANs) are gaining increasing attention as a means for synthesising data.
Here we consider the potential application of GANs for the purpose of generating synthetic census microdata.
arXiv Detail & Related papers (2021-12-03T14:23:17Z) - Learning to Synthesize Data for Semantic Parsing [57.190817162674875]
We propose a generative model which models the composition of programs and maps a program to an utterance.
Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand.
We evaluate our method in both in-domain and out-of-domain settings of text-to-Query parsing on the standard benchmarks of GeoQuery and Spider.
arXiv Detail & Related papers (2021-04-12T21:24:02Z) - Brainstorming Generative Adversarial Networks (BGANs): Towards
Multi-Agent Generative Models with Distributed Private Datasets [70.62568022925971]
generative adversarial networks (GANs) must be fed by large datasets that adequately represent the data space.
In many scenarios, the available datasets may be limited and distributed across multiple agents, each of which is seeking to learn the distribution of the data on its own.
In this paper, a novel brainstorming GAN (BGAN) architecture is proposed using which multiple agents can generate real-like data samples while operating in a fully distributed manner.
arXiv Detail & Related papers (2020-02-02T02:58:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.