Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data
- URL: http://arxiv.org/abs/2507.03971v1
- Date: Sat, 05 Jul 2025 09:39:07 GMT
- Title: Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data
- Authors: Anurag Garg, Muhammad Ali, Noah Hollmann, Lennart Purucker, Samuel Müller, Frank Hutter,
- Abstract summary: We show that this performance can be significantly boosted by a targeted continued pre-training phase.<n>We demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior predictive downstream accuracy.<n>Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.
- Score: 38.08600450054975
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior downstream predictive accuracy compared to using broader, potentially noisier corpora like CommonCrawl or GitTables. Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.
Related papers
- TabPFN: One Model to Rule Them All? [21.658323618943697]
We provide a tailored explanation of how TabPFN works for a statistics audience.<n>We show that an out-of-the-box application of TabPFN vastly outperforms specialized state-of-the-art methods.
arXiv Detail & Related papers (2025-05-26T13:55:29Z) - Prior-Fitted Networks Scale to Larger Datasets When Treated as Weak Learners [82.72552644267724]
BoostPFN can outperform standard PFNs with the same size of training samples in large datasets.<n>High performance is maintained for up to 50x of the pre-training size of PFNs.
arXiv Detail & Related papers (2025-03-03T07:31:40Z) - Towards Data-Efficient Pretraining for Atomic Property Prediction [51.660835328611626]
We show that pretraining on a task-relevant dataset can match or surpass large-scale pretraining.<n>We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr'echet Inception Distance.
arXiv Detail & Related papers (2025-02-16T11:46:23Z) - Zero-shot Meta-learning for Tabular Prediction Tasks with Adversarially Pre-trained Transformer [2.1677183904102257]
We present an Adversarially Pre-trained Transformer (APT) that is able to perform zero-shot meta-learning on tabular prediction tasks without pre-training on any real-world dataset.<n>APT is pre-trained with adversarial synthetic data agents, who deliberately challenge the model with different synthetic datasets.<n>We show that our framework matches state-of-the-art performance on small classification tasks without filtering on dataset characteristics.
arXiv Detail & Related papers (2025-02-06T23:58:11Z) - Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes [135.68092471784516]
We propose a simple and lightweight approach for fusing large language models and gradient-boosted decision trees.<n>We name our fusion methods LLM-Boost and PFN-Boost, respectively.<n>We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms.
arXiv Detail & Related papers (2025-02-04T19:30:41Z) - TabDPT: Scaling Tabular Foundation Models [20.00390825519329]
We show how to harness the power of real data to improve performance and generalization.
Our model achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks.
TabDPT also demonstrates strong scaling as both model size and amount of available data increase.
arXiv Detail & Related papers (2024-10-23T18:00:00Z) - Fine-tuned In-Context Learning Transformers are Excellent Tabular Data Classifiers [22.33649426762373]
In this work, we extend TabPFN to the fine-tuning setting, resulting in a significant performance boost.<n>We also discover that fine-tuning enables ICL-transformers to create complex decision boundaries.<n>By combining both dataset generators, we create TabForestPFN, an ICL-transformer that achieves excellent fine-tuning performance and good zero-shot performance.
arXiv Detail & Related papers (2024-05-22T07:13:55Z) - TuneTables: Context Optimization for Scalable Prior-Data Fitted Networks [90.00817095558094]
Prior-data fitted networks (PFNs) make use of pretraining and in-context learning to achieve strong performance on new tasks in a single forward pass.
We introduce TuneTables, a parameter-efficient fine-tuning strategy for PFNs that compresses large datasets into a smaller learned context.
We show that TuneTables can be used as an interpretability tool and can even be used to mitigate biases by optimizing a fairness objective.
arXiv Detail & Related papers (2024-02-17T00:02:23Z) - TabPFN: A Transformer That Solves Small Tabular Classification Problems
in a Second [48.87527918630822]
We present TabPFN, a trained Transformer that can do supervised classification for small datasets in less than a second.
TabPFN performs in-context learning (ICL), it learns to make predictions using sequences of labeled examples.
We show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 230$times$ speedup.
arXiv Detail & Related papers (2022-07-05T07:17:43Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.