Related papers: MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space

MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space

URL: http://arxiv.org/abs/2506.11684v1
Date: Fri, 13 Jun 2025 11:21:00 GMT
Title: MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space
Authors: Anshul Singh, Chris Biemann, Jan Strich,
Abstract summary: We introduce MTabVQA, a novel benchmark specifically designed for multi-tabular visual question answering.<n>MTabVQA comprises 3,745 complex question-answer pairs that necessitate multi-hop reasoning across several visually rendered table images.<n>We show that fine-tuning VLMs with MTabVQA-Instruct substantially improves their performance on visual multi-tabular reasoning.
Score: 16.35255926212628
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in interpreting visual layouts and text. However, a significant challenge remains in their ability to interpret robustly and reason over multi-tabular data presented as images, a common occurrence in real-world scenarios like web pages and digital documents. Existing benchmarks typically address single tables or non-visual data (text/structured). This leaves a critical gap: they don't assess the ability to parse diverse table images, correlate information across them, and perform multi-hop reasoning on the combined visual data. We introduce MTabVQA, a novel benchmark specifically designed for multi-tabular visual question answering to bridge that gap. MTabVQA comprises 3,745 complex question-answer pairs that necessitate multi-hop reasoning across several visually rendered table images. We provide extensive benchmark results for state-of-the-art VLMs on MTabVQA, revealing significant performance limitations. We further investigate post-training techniques to enhance these reasoning abilities and release MTabVQA-Instruct, a large-scale instruction-tuning dataset. Our experiments show that fine-tuning VLMs with MTabVQA-Instruct substantially improves their performance on visual multi-tabular reasoning. Code and dataset (https://huggingface.co/datasets/mtabvqa/MTabVQA-Eval) are available online (https://anonymous.4open.science/r/MTabVQA-EMNLP-B16E).

Related papers

Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images [0.42970700836450476]
Visual-TableQA is a large-scale, open-domain dataset designed to evaluate and enhance visual reasoning over complex data.<n>Visual-TableQA comprises 2.5k richly structured-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100.
arXiv Detail & Related papers (2025-09-09T17:52:26Z)
Multimodal Tabular Reasoning with Privileged Structured Information [67.40011423365712]
We introduce TabUlar Reasoning with Bridged infOrmation (sc Turbo)<n>sc Turbo benefits from a structure-aware reasoning trace generator based on DeepSeek-R1.<n>sc Turbo achieves state-of-the-art performance ($+7.2%$ vs. previous SOTA) across multiple datasets.
arXiv Detail & Related papers (2025-06-04T15:46:30Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data. We introduce MMTabQA, a new dataset designed for this purpose. Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z)
On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts.<n>We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z)
Multimodal Table Understanding [26.652797853893233]
How to directly understand tables using intuitive visual information is a crucial and urgent challenge for developing more practical applications. We propose a new problem, multimodal table understanding, where the model needs to generate correct responses to various table-related requests. We develop Table-LLaVA, a generalist multimodal large language model (MLLM), which significantly outperforms recent open-source MLLM baselines on 23 benchmarks.
arXiv Detail & Related papers (2024-06-12T11:27:03Z)
TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains [4.828743805126944]
This paper establishes a benchmark for table visual question answering, referred to as the TableVQA-Bench. It is important to note that existing datasets have not incorporated images or QA pairs, which are two crucial components of TableVQA.
arXiv Detail & Related papers (2024-04-30T02:05:18Z)
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning [40.972648044298374]
Multi-Modal Large Language Models (MLLMs) have demonstrated impressive performance in various VQA tasks. They often lack interpretability and struggle with complex visual inputs. We introduce the large-scale Visual CoT dataset comprising 438k question-answer pairs. We propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts.
arXiv Detail & Related papers (2024-03-25T17:59:23Z)
Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language. We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs. We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z)
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces. We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA) Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.