Related papers: SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems

SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems

URL: http://arxiv.org/abs/2503.10627v1
Date: Thu, 13 Mar 2025 17:59:32 GMT
Title: SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems
Authors: Ziyu Guo, Ray Zhang, Hao Chen, Jialin Gao, Dongzhi Jiang, Jiaze Wang, Pheng-Ann Heng,
Abstract summary: We introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess Large Multi-modal Models (LMMs)<n>We aim to investigate three key dimensions of LMMs: scientific knowledge comprehension, multi-modal content interpretation, and Chain-of-Thought (CoT) reasoning.<n>Our extensive evaluation of different LMMs on SciVerse reveals critical limitations in their scientific proficiency and provides new insights into future developments.
Score: 41.69093932236271
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scientific knowledge comprehension, multi-modal content interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs possess sufficient scientific expertise, we first transform each problem into three versions containing different levels of knowledge required for solving, i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret multi-modal scientific content, we annotate another two versions, i.e., Vision-rich and -only, marking more question information from texts to diagrams. Comparing the results of different versions, SciVerse systematically examines the professional knowledge stock and visual perception skills of LMMs in scientific domains. In addition, to rigorously assess CoT reasoning, we propose a new scientific CoT evaluation strategy, conducting a step-wise assessment on knowledge and logical errors in model outputs. Our extensive evaluation of different LMMs on SciVerse reveals critical limitations in their scientific proficiency and provides new insights into future developments. Project page: https://sciverse-cuhk.github.io

Related papers

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning [59.518397361341556]
We present the Scientists' First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of Multimodal Large Language Models (MLLMs)<n>SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines.<n>Experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms.
arXiv Detail & Related papers (2025-06-12T09:29:16Z)
Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\ ightarrow$ Evidence Reasoning [6.043212666944194]
We present CLAIM-BENCH, a benchmark for evaluating large language models' capabilities in scientific claim-evidence extraction and validation.<n>We show that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall.<n> strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs' abilities to accurately link dispersed evidence with claims.
arXiv Detail & Related papers (2025-06-09T21:04:39Z)
CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning [7.41837850475371]
We introduce CURIE, a benchmark to measure the potential of Large Language Models (LLMs) in scientific problem-solving. This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information,and multi-step reasoning.
arXiv Detail & Related papers (2025-03-14T17:53:03Z)
Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models [52.569132872560814]
multimodal large language models (MLLMs) have achieved significant breakthroughs, enhancing understanding across text and vision.<n>However, current MLLMs still face challenges in effectively integrating knowledge across these modalities during multimodal knowledge reasoning.<n>We analyze and compare the extent of consistency degradation in multimodal knowledge reasoning within MLLMs.
arXiv Detail & Related papers (2025-03-03T09:01:51Z)
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning [51.11965014462375]
Multimodal Large Language Models (MLLMs) integrate text, images, and other modalities.<n>This paper argues that MLLMs can significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology.
arXiv Detail & Related papers (2025-02-05T04:05:27Z)
A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled. We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z)
SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models [35.98892300665275]
We introduce the SciKnowEval benchmark, a framework that evaluates large language models (LLMs) across five progressive levels of scientific knowledge. These levels aim to assess the breadth and depth of scientific knowledge in LLMs, including memory, comprehension, reasoning, discernment, and application. We benchmark 26 advanced open-source and proprietary LLMs using zero-shot and few-shot prompting strategies.
arXiv Detail & Related papers (2024-06-13T13:27:52Z)
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer [22.90687836544612]
We present bfUni-text, an innovative model designed for in-depth understanding of scientific literature. Uni-text demonstrates superior performance over other text-focused LLMs. Our exploration extends to practical applications, including patent infringement detection and nuanced analysis of charts.
arXiv Detail & Related papers (2024-03-15T13:43:47Z)
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark [53.24896036161829]
We introduce a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context.
arXiv Detail & Related papers (2024-01-22T13:34:34Z)
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM) SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.