Related papers: TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

URL: http://arxiv.org/abs/2502.05564v1
Date: Sat, 08 Feb 2025 13:25:04 GMT
Title: TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
Authors: Jingang Qu, David Holzmüller, Gaël Varoquaux, Marine Le Morvan,
Abstract summary: We introduce TabICL, a foundation model for classification pretrained on synthetic datasets with up to 60K samples.<n>It is on par with TabPFNv2 while being systematically faster (up to 10 times) and significantly outperforms all other approaches.<n>On 56 datasets with over 10K samples, TabICL surpasses both TabPFNv2 and CatBoost, demonstrating the potential of ICL for large data.
Score: 15.08819125687632
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The long-standing dominance of gradient-boosted decision trees on tabular data is currently challenged by tabular foundation models using In-Context Learning (ICL): setting the training data as context for the test data and predicting in a single forward pass without parameter updates. While the very recent TabPFNv2 foundation model (2025) excels on tables with up to 10K samples, its alternating column- and row-wise attentions make handling large training sets computationally prohibitive. So, can ICL be effectively scaled and deliver a benefit for larger tables? We introduce TabICL, a tabular foundation model for classification, pretrained on synthetic datasets with up to 60K samples and capable of handling 500K samples on affordable resources. This is enabled by a novel two-stage architecture: a column-then-row attention mechanism to build fixed-dimensional embeddings of rows, followed by a transformer for efficient ICL. Across 200 classification datasets from the TALENT benchmark, TabICL is on par with TabPFNv2 while being systematically faster (up to 10 times), and significantly outperforms all other approaches. On 56 datasets with over 10K samples, TabICL surpasses both TabPFNv2 and CatBoost, demonstrating the potential of ICL for large data.

Related papers

Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data [38.08600450054975]
We show that this performance can be significantly boosted by a targeted continued pre-training phase.<n>We demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior predictive downstream accuracy.<n>Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.
arXiv Detail & Related papers (2025-07-05T09:39:07Z)
ConTextTab: A Semantics-Aware Tabular In-Context Learner [0.0]
We introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework.<n>Our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark.
arXiv Detail & Related papers (2025-06-12T13:57:29Z)
TabFlex: Scaling Tabular Learning to Millions with Linear Attention [8.018661387739574]
Recent advancements, like TabPFN, excel in small-scale datasets but struggle to scale for large and complex datasets.<n>Our work enhances the efficiency and scalability of TabPFN for larger datasets by incorporating linear attention mechanisms.<n>Our model, TabFlex, efficiently handles tabular datasets with thousands of features and hundreds of classes, scaling seamlessly to millions of samples.
arXiv Detail & Related papers (2025-06-05T20:59:33Z)
A Closer Look at TabPFN v2: Strength, Limitation, and Extension [51.08999772842298]
Tabular Prior-data Fitted Network v2 (TabPFN v2) achieves unprecedented in-context learning accuracy across multiple datasets. In this paper, we evaluate TabPFN v2 on over 300 datasets, confirming its exceptional generalization capabilities on small- to medium-scale tasks.
arXiv Detail & Related papers (2025-02-24T17:38:42Z)
Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes [135.68092471784516]
We propose a simple and lightweight approach for fusing large language models and gradient-boosted decision trees.<n>We name our fusion methods LLM-Boost and PFN-Boost, respectively.<n>We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms.
arXiv Detail & Related papers (2025-02-04T19:30:41Z)
TabDPT: Scaling Tabular Foundation Models [20.00390825519329]
We show how to harness the power of real data to improve performance and generalization. Our model achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks. TabDPT also demonstrates strong scaling as both model size and amount of available data increase.
arXiv Detail & Related papers (2024-10-23T18:00:00Z)
LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets. LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets. We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z)
Mixture of In-Context Prompters for Tabular PFNs [33.76194735049027]
MIXTUREPFN is the Condorcet winner across 36 diverse datasets against 19 strong deep learning and tree-based baselines. It achieves the highest mean rank among Top-10 aforementioned algorithms with statistical significance.
arXiv Detail & Related papers (2024-05-25T09:47:59Z)
TuneTables: Context Optimization for Scalable Prior-Data Fitted Networks [90.00817095558094]
Prior-data fitted networks (PFNs) make use of pretraining and in-context learning to achieve strong performance on new tasks in a single forward pass. We introduce TuneTables, a parameter-efficient fine-tuning strategy for PFNs that compresses large datasets into a smaller learned context. We show that TuneTables can be used as an interpretability tool and can even be used to mitigate biases by optimizing a fairness objective.
arXiv Detail & Related papers (2024-02-17T00:02:23Z)
In-Context Data Distillation with TabPFN [11.553950697974825]
In-context data distillation (ICD) is a novel methodology that effectively eliminates these constraints by optimizing TabPFN's context. ICD efficiently enables TabPFN to handle significantly larger datasets with a fixed memory budget, improving TabPFN's quadratic memory complexity but at the cost of a linear number of tuning steps.
arXiv Detail & Related papers (2024-02-10T15:23:45Z)
Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM) A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences. Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z)
MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement [44.693325083735424]
Tabular data prediction has been employed in medical applications such as patient health risk prediction. Previous predictors are often trained on manually curated small datasets.
arXiv Detail & Related papers (2023-05-20T03:37:09Z)
Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction. TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification. It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z)
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second [48.87527918630822]
We present TabPFN, a trained Transformer that can do supervised classification for small datasets in less than a second. TabPFN performs in-context learning (ICL), it learns to make predictions using sequences of labeled examples. We show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 230$times$ speedup.
arXiv Detail & Related papers (2022-07-05T07:17:43Z)
Scientific evidence extraction [0.0]
We propose a new dataset, Tables One Million (PubTables-1M), and a new class of metric, PubMed grid table similarity (GriTS) PubTables-1M is nearly twice as large as the previous largest comparable dataset. We show that object detection models trained on PubTables-1M produce excellent results out-of-the-box for all three tasks of detection, structure recognition, and functional analysis.
arXiv Detail & Related papers (2021-09-30T19:42:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.