Related papers: ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models

ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models

URL: http://arxiv.org/abs/2405.17724v2
Date: Thu, 14 Nov 2024 11:06:36 GMT
Title: ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models
Authors: Wei Pang, Masoumeh Shafieinejad, Lucy Liu, Stephanie Hazlewood, Xi He,
Abstract summary: We introduce ClavaDDPM, a novel approach to synthesizing multi-relational (multi-table) data. ClavaDDPM uses clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. We show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.
Score: 4.725559485781692
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for larger datasets and capturing long-range dependencies, such as correlations between attributes spread across different tables. Inspired by the success of diffusion models in tabular data modeling, we introduce $\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$ $\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$ $\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. ClavaDDPM leverages the robust generation capabilities of diffusion models while incorporating efficient algorithms to propagate the learned latent variables across tables. This enables ClavaDDPM to capture long-range dependencies effectively. Extensive evaluations on multi-table datasets of varying sizes show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.

Related papers

Generating Synthetic Relational Tabular Data via Structural Causal Models [0.0]
We develop a novel framework that generates realistic synthetic relational data including causal relationships across tables.<n>Our experiments confirm that this framework is able to construct relational datasets with complex inter-table dependencies mimicking real-world scenarios.
arXiv Detail & Related papers (2025-07-04T12:27:23Z)
Multimodal Tabular Reasoning with Privileged Structured Information [67.40011423365712]
We introduce TabUlar Reasoning with Bridged infOrmation (sc Turbo)<n>sc Turbo benefits from a structure-aware reasoning trace generator based on DeepSeek-R1.<n>sc Turbo achieves state-of-the-art performance ($+7.2%$ vs. previous SOTA) across multiple datasets.
arXiv Detail & Related papers (2025-06-04T15:46:30Z)
RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models [83.6013616017646]
RelDiff is a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure.<n>RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases.
arXiv Detail & Related papers (2025-05-31T21:01:02Z)
Joint Relational Database Generation via Graph-Conditional Diffusion Models [44.06390394789874]
Building generative models for databases (RDBs) is important for applications like privacy's data release and real datasets.<n>Most prior either focuses on single-table generation or relies on autoregressive factorizations that impose a fixed table order and generate tables sequentially.<n>We propose a fundamentally different approach: jointly modeling all tables in an RDB without imposing any order.
arXiv Detail & Related papers (2025-05-22T11:12:56Z)
TabRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation [16.907006955584343]
Diffusion models have been the predominant generative model for data generation. We present TabRep, a training architecture trained with a unified continuous representation. Our results showcase that TabRep achieves superior performance across a broad suite of evaluations.
arXiv Detail & Related papers (2025-04-07T07:44:27Z)
LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation. LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space. Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z)
VaeDiff-DocRE: End-to-end Data Augmentation Framework for Document-level Relation Extraction [9.516897428263146]
Document-level Relation Extraction (DocRE) aims to identify relationships between entity pairs within a document. Most existing methods assume a uniform label distribution, resulting in suboptimal performance on real-world, imbalanced datasets. We propose a novel data augmentation approach using generative models to enhance data from the embedding space.
arXiv Detail & Related papers (2024-12-18T04:55:29Z)
TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data. TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z)
Synthesizing Text-to-SQL Data from Weak and Strong LLMs [68.69270834311259]
The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to- tasks. We introduce a synthetic data approach that combines data produced by larger, more powerful models with error information data generated by smaller, not well-aligned models.
arXiv Detail & Related papers (2024-08-06T15:40:32Z)
LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets. LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets. We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z)
FedTabDiff: Federated Learning of Diffusion Probabilistic Models for Synthetic Mixed-Type Tabular Data Generation [5.824064631226058]
We introduce textitFederated Tabular Diffusion (FedTabDiff) for generating high-fidelity mixed-type tabular data without centralized access to the original datasets. FedTabDiff realizes a decentralized learning scheme that permits multiple entities to collaboratively train a generative model while respecting data privacy and locality. Experimental evaluations on real-world financial and medical datasets attest to the framework's capability to produce synthetic data that maintains high fidelity, utility, privacy, and coverage.
arXiv Detail & Related papers (2024-01-11T21:17:50Z)
TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning [55.33939289989238]
We propose TAP4LLM as a versatile pre-processor suite for leveraging large language models (LLMs) in table-based tasks effectively. It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into various formats suitable for LLMs' understanding.
arXiv Detail & Related papers (2023-12-14T15:37:04Z)
Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion Models [54.1843419649895]
We propose a solution based on denoising diffusion probabilistic models (DDPMs) Our motivation for choosing diffusion models over other generative models comes from the flexible internal structure of diffusion models. Our method can unite multiple diffusion models trained on multiple sub-tasks and conquer the combined task.
arXiv Detail & Related papers (2022-12-01T18:59:55Z)
MAVEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal, and Subevent Relation Extraction [78.61546292830081]
We construct a large-scale human-annotated ERE dataset MAVEN-ERE with improved annotation schemes. It contains 103,193 event coreference chains, 1,216,217 temporal relations, 57,992 causal relations, and 15,841 subevent relations. Experiments show that ERE on MAVEN-ERE is quite challenging, and considering relation interactions with joint learning can improve performances.
arXiv Detail & Related papers (2022-11-14T13:34:49Z)
TabDDPM: Modelling Tabular Data with Diffusion Models [33.202222842342465]
We introduce TabDDPM -- a diffusion model that can be universally applied to any dataset and handles any type of feature. We extensively evaluate TabDDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives.
arXiv Detail & Related papers (2022-09-30T12:26:14Z)
Model Joins: Enabling Analytics Over Joins of Absent Big Tables [9.797488793708624]
This work puts forth a framework, Model Join, addressing these challenges. The framework integrates and joins the per-table models of the absent tables. The approximation stems from the models, but not from the Model Join framework.
arXiv Detail & Related papers (2022-06-21T14:28:24Z)
A Novel Global Feature-Oriented Relational Triple Extraction Model based on Table Filling [1.6295073821494463]
We propose a global feature-oriented triple extraction model that makes full use of the mentioned two kinds of global associations. Experimental results show our model is effective and it achieves state-of-the-art results on all of these datasets.
arXiv Detail & Related papers (2021-09-14T14:13:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.