Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images
- URL: http://arxiv.org/abs/2509.07966v1
- Date: Tue, 09 Sep 2025 17:52:26 GMT
- Title: Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images
- Authors: Boammani Aser Lompo, Marc Haraoui,
- Abstract summary: Visual-TableQA is a large-scale, open-domain dataset designed to evaluate and enhance visual reasoning over complex data.<n>Visual-TableQA comprises 2.5k richly structured-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100.
- Score: 0.42970700836450476
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting ('inspiration') and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset's synthetic nature. The full pipeline and resources are publicly available at https://github.com/AI-4-Everyone/Visual-TableQA.
Related papers
- ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement [58.957050610762565]
ShowTable is a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process.<n> MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors.<n>We introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions.
arXiv Detail & Related papers (2025-12-15T13:21:50Z) - TABLET: A Large-Scale Dataset for Robust Visual Table Understanding [46.96642907587549]
Existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions.<n>We introduce TABLET, a large-scale VTU dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables where 88% preserve original visualizations.
arXiv Detail & Related papers (2025-09-25T14:14:27Z) - TableDART: Dynamic Adaptive Multi-Modal Routing for Table Understanding [52.59372043981724]
TableDART is a training-efficient framework that integrates multimodal views by reusing pretrained single-modality models.<n>In addition, we propose a novel agent to cross-modal knowledge integration by analyzing outputs from text- and image-based models.
arXiv Detail & Related papers (2025-09-18T07:00:13Z) - LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence [61.46575527504109]
LimiX treats structured data as a joint distribution over variables and missingness.<n>We evaluate LimiX across 10 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios.
arXiv Detail & Related papers (2025-09-03T17:39:08Z) - RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis [16.572608600078922]
RealHiTBench is a benchmark designed to evaluate the performance of Large Language Models (LLMs) across a variety of input formats.<n>Our experimental results, using 25 state-of-the-art LLMs, demonstrate that RealHiTBench is indeed a challenging benchmark.<n>We also develop TreeThinker, a tree-based pipeline that organizes hierarchical headers into a tree structure.
arXiv Detail & Related papers (2025-06-16T12:19:08Z) - Multimodal Tabular Reasoning with Privileged Structured Information [67.40011423365712]
We introduce TabUlar Reasoning with Bridged infOrmation (sc Turbo)<n>sc Turbo benefits from a structure-aware reasoning trace generator based on DeepSeek-R1.<n>sc Turbo achieves state-of-the-art performance ($+7.2%$ vs. previous SOTA) across multiple datasets.
arXiv Detail & Related papers (2025-06-04T15:46:30Z) - MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning [40.95790862132066]
MMTBENCH is a benchmark consisting of 500 real world multimodal tables drawn from diverse real world sources.<n>MMTBENCH questions cover four question types (Explicit, Implicit, Answer Mention, and Visual Based), five reasoning types (Mathematical, Extrema Identification, Fact Verification, Vision Based, and Others), and eight table types.
arXiv Detail & Related papers (2025-05-27T21:09:11Z) - GTR: Graph-Table-RAG for Cross-Table Question Answering [53.11230952572134]
We propose the first Graph-Table-RAG framework, namely GTR, which reorganizes table corpora into a heterogeneous graph.<n> GTR exhibits superior cross-table question-answering performance while maintaining high deployment efficiency, demonstrating its real-world practical applicability.
arXiv Detail & Related papers (2025-04-02T04:24:41Z) - TabGLM: Tabular Graph Language Model for Learning Transferable Representations Through Multi-Modal Consistency Minimization [2.1067477213933503]
TabGLM (Tabular Graph Language Model) is a novel multi-modal architecture designed to model both structural and semantic information from a table.<n>It transforms each row of a table into a fully connected graph and serialized text, which are encoded using a graph neural network (GNN) and a text encoder, respectively.<n> Evaluations across 25 benchmark datasets demonstrate substantial performance gains.
arXiv Detail & Related papers (2025-02-26T05:32:45Z) - Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data.
We introduce MMTabQA, a new dataset designed for this purpose.
Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z) - TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy [81.76462101465354]
We present a novel large vision-hugging model, TabPedia, equipped with a concept synergy mechanism.
This unified framework allows TabPedia to seamlessly integrate VTU tasks, such as table detection, table structure recognition, table querying, and table question answering.
To better evaluate the VTU task in real-world scenarios, we establish a new and comprehensive table VQA benchmark, ComTQA.
arXiv Detail & Related papers (2024-06-03T13:54:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.