RDBLearn: Simple In-Context Prediction Over Relational Databases
- URL: http://arxiv.org/abs/2602.18495v1
- Date: Sat, 14 Feb 2026 09:24:04 GMT
- Title: RDBLearn: Simple In-Context Prediction Over Relational Databases
- Authors: Yanlin Zhang, Linjie Xu, Quan Gan, David Wipf, Minjie Wang,
- Abstract summary: We show that a simple recipe can be extended to relational prediction with a simple recipe.<n>We package this approach in textitRDBLearn, an easy-to-use toolkit with a scikit-learn-style estimator interface.<n>Across a broad collection of RelBench and 4DBInfer datasets, RDBLearn is the best-performing foundation model approach we evaluate.
- Score: 21.996337463952255
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in tabular in-context learning (ICL) show that a single pretrained model can adapt to new prediction tasks from a small set of labeled examples, avoiding per-task training and heavy tuning. However, many real-world tasks live in relational databases, where predictive signal is spread across multiple linked tables rather than a single flat table. We show that tabular ICL can be extended to relational prediction with a simple recipe: automatically featurize each target row using relational aggregations over its linked records, materialize the resulting augmented table, and run an off-the-shelf tabular foundation model on it. We package this approach in \textit{RDBLearn} (https://github.com/HKUSHXLab/rdblearn), an easy-to-use toolkit with a scikit-learn-style estimator interface that makes it straightforward to swap different tabular ICL backends; a complementary agent-specific interface is provided as well. Across a broad collection of RelBench and 4DBInfer datasets, RDBLearn is the best-performing foundation model approach we evaluate, at times even outperforming strong supervised baselines trained or fine-tuned on each dataset.
Related papers
- Comparing Task-Agnostic Embedding Models for Tabular Data [1.6479389738270018]
This work specifically focuses on representation learning, i.e., on transferable, task-agnostic embeddings.<n>Tableizer features achieve comparable or superior performance while being up to three orders of magnitude faster than recent foundation models.
arXiv Detail & Related papers (2025-11-18T09:10:40Z) - Generalization Can Emerge in Tabular Foundation Models From a Single Table [38.07740881271672]
We show that simple self-supervised pre-training on just a emphsingle real table can produce surprisingly strong transfer across heterogeneous benchmarks.<n>We then connect this to the pre-training procedure shared by most TFMs and show that the number and quality of emphtasks one can construct from a dataset is key to downstream performance.
arXiv Detail & Related papers (2025-11-12T19:12:40Z) - Relational Database Distillation: From Structured Tables to Condensed Graph Data [48.347717300340435]
We aim to distill large-scale RDBs into compact heterogeneous graphs while retaining the power required for graph-based models.<n>We further design a kernel ridge regression-guided objective with pseudo-labels, which produces quality features for the distilled graph.
arXiv Detail & Related papers (2025-10-08T13:05:31Z) - TabICL: A Tabular Foundation Model for In-Context Learning on Large Data [15.08819125687632]
We introduce TabICL, a tabular foundation model for classification, pretrained on synthetic datasets with up to 60K samples.<n>Across 200 classification datasets from the TALENT benchmark, TabICL is on par with TabPFNv2 while being systematically faster (up to 10 times)<n>On 53 datasets with over 10K samples, TabICL surpasses both TabPFNv2 and CatBoost, demonstrating the potential of ICL for large data.
arXiv Detail & Related papers (2025-02-08T13:25:04Z) - Towards Better Understanding Table Instruction Tuning: Decoupling the Effects from Data versus Models [62.47618742274461]
We fine-tune base models from the Mistral, OLMo, and Phi families on existing public training datasets.<n>Our replication achieves performance on par with or surpassing existing table LLMs.<n>We decouple the contributions of training data and the base model, providing insight into their individual impacts.
arXiv Detail & Related papers (2025-01-24T18:50:26Z) - PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization [7.036380633387952]
We introduce PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing.
It can be effectively pre-trained on online-collected datasets and fine-tuned to match state-of-the-art methods on complex classification and regression tasks.
arXiv Detail & Related papers (2024-10-17T13:05:44Z) - RelBench: A Benchmark for Deep Learning on Relational Databases [78.52438155603781]
We present RelBench, a public benchmark for solving tasks over databases with graph neural networks.
We use RelBench to conduct the first comprehensive study of Deep Learning infrastructure.
RDL learns better whilst reducing human work needed by more than an order of magnitude.
arXiv Detail & Related papers (2024-07-29T14:46:13Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - Rethinking Pre-Training in Tabular Data: A Neighborhood Embedding Perspective [71.45945607871715]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)<n>The core idea is to embed data instances into a shared feature space, where each instance is represented by its distance to a fixed number of nearest neighbors and their labels.<n>Extensive experiments on 101 datasets confirm TabPTM's effectiveness in both classification and regression tasks, with and without fine-tuning.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - TABBIE: Pretrained Representations of Tabular Data [22.444607481407633]
We devise a simple pretraining objective that learns exclusively from tabular data.
Unlike competing approaches, our model (TABBIE) provides embeddings of all table substructures.
A qualitative analysis of our model's learned cell, column, and row representations shows that it understands complex table semantics and numerical trends.
arXiv Detail & Related papers (2021-05-06T11:15:16Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.