Related papers: MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models

MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models

URL: http://arxiv.org/abs/2508.13938v1
Date: Tue, 19 Aug 2025 15:27:55 GMT
Title: MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
Authors: Jiacheng Ruan, Dan Jiang, Xian Gao, Ting Liu, Yuzhuo Fu, Yangyang Kang,
Abstract summary: multimodal large language models (MLLMs) have achieved significant advancements across various domains.<n> benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs.<n>Existing benchmarks still face three key challenges: 1) Insufficient evaluation of models' reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs' comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points.
Score: 27.425377320206444
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models' reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs' comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and support five languages: Chinese, English, French, Spanish, and Japanese. We conducted extensive experiments on 16 open-source models and 4 closed-source models, and the results demonstrate that MME-SCI is widely challenging for existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini achieved accuracy of only 52.11%, 24.73%, 36.57%, and 29.80% in mathematics, physics, chemistry, and biology, respectively, indicating a significantly higher difficulty level compared to existing benchmarks. More importantly, using MME-SCI's multilingual and fine-grained knowledge attributes, we analyzed existing models' performance in depth and identified their weaknesses in specific domains. The Data and Evaluation Code are available at https://github.com/JCruan519/MME-SCI.

Related papers

HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery [50.8841471967624]
HiSciBench is a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow.<n>HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines.
arXiv Detail & Related papers (2025-12-28T12:08:05Z)
Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams [20.432924845981255]
Multimodal scientific reasoning remains a significant challenge for large language models (LLMs)<n>We systematically evaluate 40 proprietary and open-source multimodal LLMs on a curated benchmark of Olympiad-style chemistry questions.<n>Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs.
arXiv Detail & Related papers (2025-12-17T00:49:00Z)
Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems [15.023749693065406]
We introduce textbf Multi-Physics for Chinese physics reasoning, a comprehensive benchmark that includes 5 difficulty levels.<n>We employ a dual evaluation framework to evaluate 20 different MLLMs, analyzing both final answer accuracy and the step-by-step integrity of their chain-of-thought.
arXiv Detail & Related papers (2025-09-19T10:18:48Z)
MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams [50.293164501645975]
Multimodal large language models (MLLMs) integrate language and visual cues for problem-solving.<n>Current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge.<n>We introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six disciplines.
arXiv Detail & Related papers (2025-08-09T06:21:10Z)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.<n>These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.<n>Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [63.43699771428243]
EmbodiedBench is an extensive benchmark designed to evaluate vision-driven embodied agents.<n>We evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench.<n> MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average.
arXiv Detail & Related papers (2025-02-13T18:11:34Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning [20.56989082014445]
Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks.<n>We present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning.<n>The best performance observed include a 53.4% accuracy in mathematics by Claude3.5-Sonnet, 38.2% in physics by GPT-4o, and 47.0% in chemistry by Gemini-1.5-Pro.
arXiv Detail & Related papers (2024-09-10T01:20:26Z)
A Survey on Benchmarks of Multimodal Large Language Models [65.87641718350639]
This paper presents a comprehensive review of 200 benchmarks and evaluations for Multimodal Large Language Models (MLLMs) We focus on (1)perception and understanding, (2)cognition and reasoning, (3)specific domains, (4)key capabilities, and (5)other modalities. Our key argument is that evaluation should be regarded as a crucial discipline to support the development of MLLMs better.
arXiv Detail & Related papers (2024-08-16T09:52:02Z)
MULTI: Multimodal Understanding Leaderboard with Text and Images [24.04211732343361]
We present MULTI, a Chinese multimodal dataset derived from authentic examination questions.<n> MULTI evaluates models using real-world examination standards, encompassing image-text comprehension, complex reasoning, and knowledge recall.<n>Our evaluation highlights substantial room for MLLM advancement, with Qwen2-VL-72B achieving a 76.9% accuracy on MULTI and 53.1% on MULTI-Elite leading 25 evaluated models.
arXiv Detail & Related papers (2024-02-05T16:41:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.