PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models
- URL: http://arxiv.org/abs/2602.04029v1
- Date: Tue, 03 Feb 2026 21:35:18 GMT
- Title: PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models
- Authors: Vignesh Kothapalli, Rishabh Ranjan, Valter Hudovernik, Vijay Prakash Dwivedi, Johannes Hoffart, Carlos Guestrin, Jure Leskovec,
- Abstract summary: We introduce Pluel, a framework to synthesize multi-tabular relational databases from scratch.<n>In a step-by-step fashion, Pluel models (1) schemas with directed graphs, (2) inter-table primary-foreign key connectivity with bipartite graphs, and, (3) feature distributions in tables via conditional causal mechanisms.
- Score: 51.42043158297229
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Relational Foundation Models (RFMs) facilitate data-driven decision-making by learning from complex multi-table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints. While there are methods to generate synthetic tabular data of arbitrary size, incorporating schema structure and primary--foreign key connectivity for multi-table generation remains challenging. Here we introduce PluRel, a framework to synthesize multi-tabular relational databases from scratch. In a step-by-step fashion, PluRel models (1) schemas with directed graphs, (2) inter-table primary-foreign key connectivity with bipartite graphs, and, (3) feature distributions in tables via conditional causal mechanisms. The design space across these stages supports the synthesis of a wide range of diverse databases, while being computationally lightweight. Using PluRel, we observe for the first time that (1) RFM pretraining loss exhibits power-law scaling with the number of synthetic databases and total pretraining tokens, (2) scaling the number of synthetic databases improves generalization to real databases, and (3) synthetic pretraining yields strong base models for continued pretraining on real databases. Overall, our framework and results position synthetic data scaling as a promising paradigm for RFMs.
Related papers
- Relational In-Context Learning via Synthetic Pre-training with Structural Prior [60.404256960057545]
RDB-PFN is the first relational foundation model trained purely via $textbfsynthetic$.<n>Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables.<n>Experiments verify RDB-PFN achieves strong few-shot performance on 19 real-world prediction tasks.
arXiv Detail & Related papers (2026-03-04T07:30:54Z) - Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models [85.64873567417396]
We introduce Mitra, a TFM trained on a curated mixture of synthetic priors selected for their diversity, distinctiveness, and performance on real-world data.<n>Mitra consistently outperforms state-of-the-art TFMs, such as TabPFNv2 and TabICL, across both classification and regression benchmarks.
arXiv Detail & Related papers (2025-10-24T07:15:06Z) - Synthesize, Retrieve, and Propagate: A Unified Predictive Modeling Framework for Relational Databases [34.57267286892218]
We propose SRP, a unified predictive modeling framework that synthesizes features using the unary dependency.<n>SRP is designed to fully capture both the unary and the composite dependencies within a relational database.
arXiv Detail & Related papers (2025-08-10T07:59:41Z) - Generating Synthetic Relational Tabular Data via Structural Causal Models [0.0]
We develop a novel framework that generates realistic synthetic relational data including causal relationships across tables.<n>Our experiments confirm that this framework is able to construct relational datasets with complex inter-table dependencies mimicking real-world scenarios.
arXiv Detail & Related papers (2025-07-04T12:27:23Z) - Relational Deep Learning: Challenges, Foundations and Next-Generation Architectures [50.46688111973999]
Graph machine learning has led to a significant increase in the capabilities of models that learn on arbitrary graph-structured data.<n>We present a new blueprint that enables end-to-end representation of'relational entity graphs' without traditional engineering feature.<n>We discuss key challenges including large-scale multi-table integration and the complexities of modeling temporal dynamics and heterogeneous data.
arXiv Detail & Related papers (2025-06-19T23:51:38Z) - RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models [83.6013616017646]
RelDiff is a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure.<n>RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases.
arXiv Detail & Related papers (2025-05-31T21:01:02Z) - LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion [49.898152180805454]
Synthetic datasets must maintain domain-specific logical consistency.<n>Existing generative models often overlook these inter-column relationships.<n>This study presents the first method to effectively preserve inter-column relationships without requiring domain knowledge.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - GFS: Graph-based Feature Synthesis for Prediction over Relational
Databases [39.975491511390985]
We propose a novel framework called Graph-based Feature Synthesis (GFS)
GFS formulates relational database as a heterogeneous graph database.
In an experiment over four real-world multi-table relational databases, GFS outperforms previous methods designed for relational databases.
arXiv Detail & Related papers (2023-12-04T16:54:40Z) - Generating Realistic Synthetic Relational Data through Graph Variational
Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases.
The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.