Related papers: ConTextTab: A Semantics-Aware Tabular In-Context Learner

ConTextTab: A Semantics-Aware Tabular In-Context Learner

URL: http://arxiv.org/abs/2506.10707v2
Date: Tue, 08 Jul 2025 19:44:57 GMT
Title: ConTextTab: A Semantics-Aware Tabular In-Context Learner
Authors: Marco Spinaci, Marek Polewczyk, Maximilian Schambach, Sam Thelin,
Abstract summary: We introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework.<n>Our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. While being architecturally efficient and well-adapted to tabular data structures, current table-native ICL architectures, being trained exclusively on synthetic data, do not fully leverage the rich semantics and world knowledge contained in real-world tabular data. On another end of this spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and checkpoints are available at https://github.com/SAP-samples/contexttab

Related papers

Multimodal Tabular Reasoning with Privileged Structured Information [67.40011423365712]
We introduce TabUlar Reasoning with Bridged infOrmation (sc Turbo)<n>sc Turbo benefits from a structure-aware reasoning trace generator based on DeepSeek-R1.<n>sc Turbo achieves state-of-the-art performance ($+7.2%$ vs. previous SOTA) across multiple datasets.
arXiv Detail & Related papers (2025-06-04T15:46:30Z)
TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations [14.12892960275563]
Tabular Foundation Models can leverage real-world knowledge and generalize across diverse datasets.<n>We introduce TabSTAR: a Foundation Tabular Model with Semantically Target-Aware Representations.
arXiv Detail & Related papers (2025-05-23T17:34:28Z)
Make Still Further Progress: Chain of Thoughts for Tabular Data Leaderboard [27.224577475861214]
Tabular data, a fundamental data format in machine learning, is predominantly utilized in competitions and real-world applications.<n>We propose an in-context ensemble framework for tabular prediction that leverages large language models.<n>Our method constructs a context around each test instance using its nearest neighbors and the predictions from a pool of external models.
arXiv Detail & Related papers (2025-05-19T17:52:58Z)
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data [15.08819125687632]
We introduce TabICL, a tabular foundation model for classification, pretrained on synthetic datasets with up to 60K samples.<n>Across 200 classification datasets from the TALENT benchmark, TabICL is on par with TabPFNv2 while being systematically faster (up to 10 times)<n>On 53 datasets with over 10K samples, TabICL surpasses both TabPFNv2 and CatBoost, demonstrating the potential of ICL for large data.
arXiv Detail & Related papers (2025-02-08T13:25:04Z)
Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models [15.603556124006479]
We propose retrieval-augmented language models for scalable TabICL.<n>Our approach incorporates a customized retrieval module, combined with retrieval-guided instruction-tuning for LLMs.<n>This enables LLMs to effectively leverage larger datasets, achieving significantly improved performance across 69 widely recognized datasets.
arXiv Detail & Related papers (2025-02-05T13:16:41Z)
TabDPT: Scaling Tabular Foundation Models [20.00390825519329]
We show how to harness the power of real data to improve performance and generalization. Our model achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks. TabDPT also demonstrates strong scaling as both model size and amount of available data increase.
arXiv Detail & Related papers (2024-10-23T18:00:00Z)
UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition [55.153629718464565]
We introduce UniTabNet, a novel framework for table structure parsing based on the image-to-text model. UniTabNet employs a divide-and-conquer'' strategy, utilizing an image-to-text model to decouple table cells and integrating both physical and logical decoders to reconstruct the complete table structure.
arXiv Detail & Related papers (2024-09-20T01:26:32Z)
LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets. LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets. We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z)
PixT3: Pixel-based Table-To-Text Generation [66.96636025277536]
We present PixT3, a multimodal table-to-text model that overcomes the challenges of linearization and input size limitations. Experiments on the ToTTo and Logic2Text benchmarks show that PixT3 is competitive and superior to generators that operate solely on text.
arXiv Detail & Related papers (2023-11-16T11:32:47Z)
TabLLM: Few-shot Classification of Tabular Data with Large Language Models [66.03023402174138]
We study the application of large language models to zero-shot and few-shot classification. We evaluate several serialization methods including templates, table-to-text models, and large language models. This approach is also competitive with strong traditional baselines like gradient-boosted trees.
arXiv Detail & Related papers (2022-10-19T17:08:13Z)
SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning [5.5616364225463055]
We introduce a new framework, Subsetting features of Tabular data (SubTab) In this paper, we introduce a new framework, Subsetting features of Tabular data (SubTab) We argue that reconstructing the data from the subset of its features rather than its corrupted version in an autoencoder setting can better capture its underlying representation.
arXiv Detail & Related papers (2021-10-08T20:11:09Z)
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing. We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar. To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables. TaBERT is trained on a large corpus of 26 million tables and their English contexts. Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.