WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts
- URL: http://arxiv.org/abs/2506.15594v1
- Date: Wed, 18 Jun 2025 16:09:18 GMT
- Title: WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts
- Authors: Negar Foroutan, Angelika Romanou, Matin Ansaripour, Julian Martin Eisenschlos, Karl Aberer, Rémi Lebret,
- Abstract summary: This paper introduces WikiMixQA, a benchmark for evaluating cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages.<n>We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve 70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required.
- Score: 14.966795545558474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.
Related papers
- When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents [3.4992819560032267]
Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored.<n>We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding.<n>The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning.
arXiv Detail & Related papers (2026-02-11T00:04:56Z) - UEval: A Benchmark for Unified Multimodal Generation [27.555018737280772]
We introduce UEval, a benchmark to evaluate unified models capable of generating both images and text.<n> UEval comprises 1,000 expert-curated questions that require both images and text in the model output.<n>Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations.
arXiv Detail & Related papers (2026-01-29T18:59:52Z) - VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering [53.662676566188175]
A key bottleneck lies in the lack of public, large-scale, high-quality Scientific Visual Question Answering (SVQA) datasets.<n>We propose a verification-centric Generate-then-Verify framework that first generates QA pairs with figure-associated textual context.<n>We instantiate this framework to curate VeriSciQA, a dataset of 20,351 QA pairs spanning 20 scientific domains and 12 figure types.
arXiv Detail & Related papers (2025-11-25T04:14:52Z) - Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering [60.062194349648195]
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents.<n>Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches.<n>We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains.
arXiv Detail & Related papers (2025-05-22T09:52:57Z) - ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models [37.54872845368151]
We conduct a case study using a synthetic dataset solvable only through visual reasoning.<n>We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions.<n>Although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%.
arXiv Detail & Related papers (2025-05-19T17:59:27Z) - M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? [49.53982792497275]
We investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in the document.<n>Existing document understanding benchmarks often assess LVLMs using question-answer formats.<n>We introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench)<n>M-DocSum-Bench comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences.
arXiv Detail & Related papers (2025-03-27T07:28:32Z) - MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers [10.311462547308823]
This work presents MMCR, a benchmark designed to evaluate Vision-Language Models' capacity for reasoning with cross-source information from scientific papers.<n>Experiments with 18 VLMs demonstrate that cross-source reasoning presents a substantial challenge for existing models.
arXiv Detail & Related papers (2025-03-21T05:02:20Z) - REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark [16.55516587540082]
We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval.<n>We propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching.<n>Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing.
arXiv Detail & Related papers (2025-02-17T22:10:47Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.<n>LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.<n>We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations [105.10376440302076]
This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions.
It is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens.
Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models.
arXiv Detail & Related papers (2024-07-01T17:59:26Z) - CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs [62.84082370758761]
CharXiv is a comprehensive evaluation suite involving 2,323 charts from arXiv papers.
To ensure quality, all charts and questions are handpicked, curated, and verified by human experts.
Results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model.
arXiv Detail & Related papers (2024-06-26T17:50:11Z) - MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large
Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities.
By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up.
Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.