Towards Benchmarking Foundation Models for Tabular Data With Text
- URL: http://arxiv.org/abs/2507.07829v1
- Date: Thu, 10 Jul 2025 15:01:31 GMT
- Title: Towards Benchmarking Foundation Models for Tabular Data With Text
- Authors: Martin Mráz, Breenda Das, Anshul Gupta, Lennart Purucker, Frank Hutter,
- Abstract summary: We propose a series of simple yet effective ablation-style strategies for incorporating text into conventional pipelines.<n>We benchmark how state-of-the-art tabular foundation models can handle textual data by manually curating a collection of real-world datasets with meaningful textual features.
- Score: 36.3195231571412
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundation models for tabular data are rapidly evolving, with increasing interest in extending them to support additional modalities such as free-text features. However, existing benchmarks for tabular data rarely include textual columns, and identifying real-world tabular datasets with semantically rich text features is non-trivial. We propose a series of simple yet effective ablation-style strategies for incorporating text into conventional tabular pipelines. Moreover, we benchmark how state-of-the-art tabular foundation models can handle textual data by manually curating a collection of real-world tabular datasets with meaningful textual features. Our study is an important step towards improving benchmarking of foundation models for tabular data with text.
Related papers
- TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations [14.12892960275563]
Tabular Foundation Models can leverage real-world knowledge and generalize across diverse datasets.<n>We introduce TabSTAR: a Foundation Tabular Model with Semantically Target-Aware Representations.
arXiv Detail & Related papers (2025-05-23T17:34:28Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Exploring the Robustness of Language Models for Tabular Question Answering via Attention Analysis [7.486549276995143]
Large Language Models (LLMs) have been shown to tackle table comprehension tasks without specific training.<n>We probe how in-context learning (ICL), model scale, instruction tuning, and domain bias affect Tabular QA (TQA)<n>We reveal a strong correlation between perturbation-induced shifts in attention dispersion and the drops in performance.
arXiv Detail & Related papers (2024-06-18T15:41:15Z) - PixT3: Pixel-based Table-To-Text Generation [66.96636025277536]
We present PixT3, a multimodal table-to-text model that overcomes the challenges of linearization and input size limitations.
Experiments on the ToTTo and Logic2Text benchmarks show that PixT3 is competitive and superior to generators that operate solely on text.
arXiv Detail & Related papers (2023-11-16T11:32:47Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - Stylized Data-to-Text Generation: A Case Study in the E-Commerce Domain [53.22419717434372]
We propose a new task, namely stylized data-to-text generation, whose aim is to generate coherent text according to a specific style.
This task is non-trivial, due to three challenges: the logic of the generated text, unstructured style reference, and biased training samples.
We propose a novel stylized data-to-text generation model, named StyleD2T, comprising three components: logic planning-enhanced data embedding, mask-based style embedding, and unbiased stylized text generation.
arXiv Detail & Related papers (2023-05-05T03:02:41Z) - Leveraging Data Recasting to Enhance Tabular Reasoning [21.970920861791015]
Prior work has mostly relied on two data generation strategies.
The first is human annotation, which yields linguistically diverse data but is difficult to scale.
The second category for creation is synthetic generation, which is scalable and cost effective but lacks inventiveness.
arXiv Detail & Related papers (2022-11-23T00:04:57Z) - TabLLM: Few-shot Classification of Tabular Data with Large Language
Models [66.03023402174138]
We study the application of large language models to zero-shot and few-shot classification.
We evaluate several serialization methods including templates, table-to-text models, and large language models.
This approach is also competitive with strong traditional baselines like gradient-boosted trees.
arXiv Detail & Related papers (2022-10-19T17:08:13Z) - TabText: A Flexible and Contextual Approach to Tabular Data
Representation [4.116980088382032]
TabText is a processing framework that extracts contextual information from tabular data structures.
We show that TabText improves the average and worst-case AUC performance of standard machine learning models by as much as 6%.
arXiv Detail & Related papers (2022-06-21T13:28:57Z) - SubTab: Subsetting Features of Tabular Data for Self-Supervised
Representation Learning [5.5616364225463055]
We introduce a new framework, Subsetting features of Tabular data (SubTab)
In this paper, we introduce a new framework, Subsetting features of Tabular data (SubTab)
We argue that reconstructing the data from the subset of its features rather than its corrupted version in an autoencoder setting can better capture its underlying representation.
arXiv Detail & Related papers (2021-10-08T20:11:09Z) - ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.