No Need to Train Your RDB Foundation Model
- URL: http://arxiv.org/abs/2602.13697v1
- Date: Sat, 14 Feb 2026 09:38:57 GMT
- Title: No Need to Train Your RDB Foundation Model
- Authors: Linjie Xu, Yanlin Zhang, Quan Gan, Minjie Wang, David Wipf,
- Abstract summary: We present a family of RDB encoders that can be seamlessly paired with already-existing single-table ICL foundation models.<n>From a practical standpoint, we develop scalable SQL primitives to implement the encoder stage, resulting in an easy-to-use open-source RDB foundation model.
- Score: 21.996337463952255
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Relational databases (RDBs) contain vast amounts of heterogeneous tabular information that can be exploited for predictive modeling purposes. But since the space of potential targets is vast across enterprise settings, how can we \textit{avoid retraining} a new model each time we wish to predict a new quantity of interest? Foundation models based on in-context learning (ICL) offer a convenient option, but so far are largely restricted to single-table operability. In generalizing to multiple interrelated tables, it is essential to compress variably-sized RDB neighborhoods into fixed-length ICL samples for consumption by the decoder. However, the details here are critical: unlike existing supervised learning RDB pipelines, we provide theoretical and empirical evidence that ICL-specific compression should be constrained \emph{within} high-dimensional RDB columns where all entities share units and roles, not \textit{across} columns where the relevance of heterogeneous data types cannot possibly be determined without label information. Conditioned on this restriction, we then demonstrate that encoder expressiveness is actually not compromised by excluding trainable parameters. Hence we arrive at a principled family of RDB encoders that can be seamlessly paired with already-existing single-table ICL foundation models, whereby no training or fine-tuning is required. From a practical standpoint, we develop scalable SQL primitives to implement the encoder stage, resulting in an easy-to-use open-source RDB foundation model\footnote{\label{foot: RDBLearn_learn} https://github.com/HKUSHXLab/rdblearn} capable of robust performance on unseen datasets out of the box.
Related papers
- Relational In-Context Learning via Synthetic Pre-training with Structural Prior [60.404256960057545]
RDB-PFN is the first relational foundation model trained purely via $textbfsynthetic$.<n>Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables.<n>Experiments verify RDB-PFN achieves strong few-shot performance on 19 real-world prediction tasks.
arXiv Detail & Related papers (2026-03-04T07:30:54Z) - RDBLearn: Simple In-Context Prediction Over Relational Databases [21.996337463952255]
We show that a simple recipe can be extended to relational prediction with a simple recipe.<n>We package this approach in textitRDBLearn, an easy-to-use toolkit with a scikit-learn-style estimator interface.<n>Across a broad collection of RelBench and 4DBInfer datasets, RDBLearn is the best-performing foundation model approach we evaluate.
arXiv Detail & Related papers (2026-02-14T09:24:04Z) - PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models [51.42043158297229]
We introduce Pluel, a framework to synthesize multi-tabular relational databases from scratch.<n>In a step-by-step fashion, Pluel models (1) schemas with directed graphs, (2) inter-table primary-foreign key connectivity with bipartite graphs, and, (3) feature distributions in tables via conditional causal mechanisms.
arXiv Detail & Related papers (2026-02-03T21:35:18Z) - Generalization Can Emerge in Tabular Foundation Models From a Single Table [38.07740881271672]
We show that simple self-supervised pre-training on just a emphsingle real table can produce surprisingly strong transfer across heterogeneous benchmarks.<n>We then connect this to the pre-training procedure shared by most TFMs and show that the number and quality of emphtasks one can construct from a dataset is key to downstream performance.
arXiv Detail & Related papers (2025-11-12T19:12:40Z) - Relational Database Distillation: From Structured Tables to Condensed Graph Data [48.347717300340435]
We aim to distill large-scale RDBs into compact heterogeneous graphs while retaining the power required for graph-based models.<n>We further design a kernel ridge regression-guided objective with pseudo-labels, which produces quality features for the distilled graph.
arXiv Detail & Related papers (2025-10-08T13:05:31Z) - TabINR: An Implicit Neural Representation Framework for Tabular Data Imputation [0.6407815281667869]
We introduce TabINR, an auto-decoder based Implicit Neural Representation framework that models tables as neural functions.<n>We evaluate our framework across a diverse range of twelve real-world datasets and multiple missingness mechanisms.
arXiv Detail & Related papers (2025-10-01T17:24:35Z) - SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - DecoyDB: A Dataset for Graph Contrastive Learning in Protein-Ligand Binding Affinity Prediction [10.248499818896693]
Predicting the binding affinity of protein-ligand complexes plays a vital role in drug discovery.<n>The widely used PDBbind dataset has fewer than 20K labeled complexes.<n>We propose DecoyDB, a large-scale, structure-aware dataset for self-supervised graph contrastive learning.
arXiv Detail & Related papers (2025-07-08T20:02:53Z) - Joint Relational Database Generation via Graph-Conditional Diffusion Models [44.06390394789874]
Building generative models for databases (RDBs) is important for applications like privacy's data release and real datasets.<n>Most prior either focuses on single-table generation or relies on autoregressive factorizations that impose a fixed table order and generate tables sequentially.<n>We propose a fundamentally different approach: jointly modeling all tables in an RDB without imposing any order.
arXiv Detail & Related papers (2025-05-22T11:12:56Z) - TabDPT: Scaling Tabular Foundation Models on Real Data [20.00390825519329]
We propose an approach to combine ICL-based retrieval with self supervised learning to train foundation models.<n>We show that incorporating real data during the pre-training phase can lead to significantly faster training and better generalization to unseen data.<n>Our resulting model, TabDPT, achieves top performance on both regression (CTR23) and classification (CC18) benchmarks.
arXiv Detail & Related papers (2024-10-23T18:00:00Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - Serving Deep Learning Model in Relational Databases [70.53282490832189]
Serving deep learning (DL) models on relational data has become a critical requirement across diverse commercial and scientific domains.
We highlight three pivotal paradigms: The state-of-the-art DL-centric architecture offloads DL computations to dedicated DL frameworks.
The potential UDF-centric architecture encapsulates one or more tensor computations into User Defined Functions (UDFs) within the relational database management system (RDBMS)
arXiv Detail & Related papers (2023-10-07T06:01:35Z) - Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics.
We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data.
Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.