Related papers: Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

URL: http://arxiv.org/abs/2507.10787v1
Date: Mon, 14 Jul 2025 20:35:25 GMT
Title: Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers
Authors: Yilun Zhao, Chengye Wang, Chuhan Li, Arman Cohan,
Abstract summary: This paper introduces MISS-QA, the first benchmark designed to evaluate the ability of models to interpret schematic diagrams within scientific literature.<n> MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers.<n>We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL.
Score: 22.83126850650448
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the ability of models to interpret schematic diagrams within scientific literature. MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers. In this benchmark, models are tasked with interpreting schematic diagrams that illustrate research overviews and answering corresponding information-seeking questions based on the broader context of the paper. We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significant performance gap between these models and human experts on MISS-QA. Our analysis of model performance on unanswerable questions and our detailed error analysis further highlight the strengths and limitations of current models, offering key insights to enhance models in comprehending multimodal scientific literature.

Related papers

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification [29.63899315962693]
SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification.<n>We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL.<n>Our experiment reveals a substantial performance gap between these models and human experts on SciVer.
arXiv Detail & Related papers (2025-06-18T15:43:26Z)
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models [37.54872845368151]
We conduct a case study using a synthetic dataset solvable only through visual reasoning.<n>We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions.<n>Although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%.
arXiv Detail & Related papers (2025-05-19T17:59:27Z)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.<n>These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.<n>Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers [10.311462547308823]
This work presents MMCR, a benchmark designed to evaluate Vision-Language Models' capacity for reasoning with cross-source information from scientific papers.<n>Experiments with 18 VLMs demonstrate that cross-source reasoning presents a substantial challenge for existing models.
arXiv Detail & Related papers (2025-03-21T05:02:20Z)
A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models [74.48084001058672]
The rise of foundation models has transformed machine learning research.<n> multimodal foundation models (MMFMs) pose unique interpretability challenges beyond unimodal frameworks.<n>This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems.
arXiv Detail & Related papers (2025-02-22T20:55:26Z)
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models [27.910693214922052]
We introduce M3SciQA, a multi-modal, multi-document scientific question answering benchmark for evaluating foundation models. M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing paper clusters. Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents.
arXiv Detail & Related papers (2024-11-06T17:52:01Z)
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark [53.61633384281524]
PolyMATH is a benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. The best scores achieved on PolyMATH are 41%, 36%, and 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively. A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning.
arXiv Detail & Related papers (2024-10-06T20:35:41Z)
A Survey on Multimodal Benchmarks: In the Era of Large AI Models [13.299775710527962]
Multimodal Large Language Models (MLLMs) have brought substantial advancements in artificial intelligence. This survey systematically reviews 211 benchmarks that assess MLLMs across four core domains: understanding, reasoning, generation, and application.
arXiv Detail & Related papers (2024-09-21T15:22:26Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
Learning from models beyond fine-tuning [78.20895343699658]
Learn From Model (LFM) focuses on the research, modification, and design of foundation models (FM) based on the model interface.<n>The study of LFM techniques can be broadly categorized into five major areas: model tuning, model distillation, model reuse, meta learning and model editing.<n>This paper gives a comprehensive review of the current methods based on FM from the perspective of LFM.
arXiv Detail & Related papers (2023-10-12T10:20:36Z)
Multilingual Multi-Aspect Explainability Analyses on Machine Reading Comprehension Models [76.48370548802464]
This paper focuses on conducting a series of analytical experiments to examine the relations between the multi-head self-attention and the final MRC system performance. We discover that passage-to-question and passage understanding attentions are the most important ones in the question answering process. Through comprehensive visualizations and case studies, we also observe several general findings on the attention maps, which can be helpful to understand how these models solve the questions.
arXiv Detail & Related papers (2021-08-26T04:23:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.