Related papers: TIME: TabPFN-Integrated Multimodal Engine for Robust Tabular-Image Learning

TIME: TabPFN-Integrated Multimodal Engine for Robust Tabular-Image Learning

URL: http://arxiv.org/abs/2506.00813v1
Date: Sun, 01 Jun 2025 03:29:30 GMT
Title: TIME: TabPFN-Integrated Multimodal Engine for Robust Tabular-Image Learning
Authors: Jiaqi Luo, Yuan Yuan, Shixin Xu,
Abstract summary: Tabular-image multimodal learning holds great promise for a variety of tasks, especially in medical applications.<n>We propose the TabPFN-Integrated Multimodal Engine (TIME), a novel multimodal framework that builds on the recently introduced TabPFN.<n>TIME generates robust, strong embeddings that are naturally resilient to missing data, and combines them with image features from pretrained vision backbones.
Score: 3.559225731091162
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tabular-image multimodal learning, which integrates structured tabular data with imaging data, holds great promise for a variety of tasks, especially in medical applications. Yet, two key challenges remain: (1) the lack of a standardized, pretrained representation for tabular data, as is commonly available in vision and language domains; and (2) the difficulty of handling missing values in the tabular modality, which are common in real-world medical datasets. To address these issues, we propose the TabPFN-Integrated Multimodal Engine (TIME), a novel multimodal framework that builds on the recently introduced tabular foundation model, TabPFN. TIME leverages TabPFN as a frozen tabular encoder to generate robust, strong embeddings that are naturally resilient to missing data, and combines them with image features from pretrained vision backbones. We explore a range of fusion strategies and tabular encoders, and evaluate our approach on both natural and medical datasets. Extensive experiments demonstrate that TIME consistently outperforms competitive baselines across both complete and incomplete tabular inputs, underscoring its practical value in real-world multimodal learning scenarios.

Related papers

Multimodal Tabular Reasoning with Privileged Structured Information [67.40011423365712]
We introduce TabUlar Reasoning with Bridged infOrmation (sc Turbo)<n>sc Turbo benefits from a structure-aware reasoning trace generator based on DeepSeek-R1.<n>sc Turbo achieves state-of-the-art performance ($+7.2%$ vs. previous SOTA) across multiple datasets.
arXiv Detail & Related papers (2025-06-04T15:46:30Z)
TabNSA: Native Sparse Attention for Efficient Tabular Data Learning [13.110156202816112]
TabNSA is a novel deep learning framework that integrates Native Sparse Attention (NSA) with a TabMixer backbone.<n>NSA employs a hierarchical sparse attention mechanism, including token compression, selective preservation, and localized sliding windows.<n>Experiments show that TabNSA consistently outperforms state-of-the-art deep learning models.
arXiv Detail & Related papers (2025-03-12T21:13:41Z)
A Closer Look at TabPFN v2: Strength, Limitation, and Extension [51.08999772842298]
Tabular Prior-data Fitted Network v2 (TabPFN v2) achieves unprecedented in-context learning accuracy across multiple datasets.<n>In this paper, we evaluate TabPFN v2 on over 300 datasets, confirming its exceptional generalization capabilities on small- to medium-scale tasks.
arXiv Detail & Related papers (2025-02-24T17:38:42Z)
Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data. We introduce MMTabQA, a new dataset designed for this purpose. Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z)
TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data [6.414759311130015]
We propose TIP, a novel framework for learning multimodal representations robust to incomplete data. Specifically, TIP investigates a self-supervised learning (SSL) strategy, including a masked reconstruction task for tackling data missingness. TIP outperforms state-of-the-art supervised/SSL image/multimodal algorithms in both complete and incomplete data scenarios.
arXiv Detail & Related papers (2024-07-10T12:16:15Z)
NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild. NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities. We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z)
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos [106.06278332186106]
Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction. Numerous limitations exist within existing public MSMO datasets. We have meticulously curated the textbfMMSum dataset.
arXiv Detail & Related papers (2023-06-07T07:43:11Z)
Learning Representations without Compositional Assumptions [79.12273403390311]
We propose a data-driven approach that learns feature set dependencies by representing feature sets as graph nodes and their relationships as learnable edges. We also introduce LEGATO, a novel hierarchical graph autoencoder that learns a smaller, latent graph to aggregate information from multiple views dynamically.
arXiv Detail & Related papers (2023-05-31T10:36:10Z)
Best of Both Worlds: Multimodal Contrastive Learning with Tabular and Imaging Data [7.49320945341034]
We propose the first self-supervised contrastive learning framework to train unimodal encoders. Our solution combines SimCLR and SCARF, two leading contrastive learning strategies. We show the generalizability of our approach to natural images using the DVM car advertisement dataset.
arXiv Detail & Related papers (2023-03-24T15:44:42Z)
Learning Multimodal Data Augmentation in Feature Space [65.54623807628536]
LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space. We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
arXiv Detail & Related papers (2022-12-29T20:39:36Z)
SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning [5.5616364225463055]
We introduce a new framework, Subsetting features of Tabular data (SubTab) In this paper, we introduce a new framework, Subsetting features of Tabular data (SubTab) We argue that reconstructing the data from the subset of its features rather than its corrupted version in an autoencoder setting can better capture its underlying representation.
arXiv Detail & Related papers (2021-10-08T20:11:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.