Related papers: Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams

Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams

URL: http://arxiv.org/abs/2512.14989v1
Date: Wed, 17 Dec 2025 00:49:00 GMT
Title: Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams
Authors: Yiming Cui, Xin Yao, Yuxuan Qin, Xin Li, Shijin Wang, Guoping Hu,
Abstract summary: Multimodal scientific reasoning remains a significant challenge for large language models (LLMs)<n>We systematically evaluate 40 proprietary and open-source multimodal LLMs on a curated benchmark of Olympiad-style chemistry questions.<n>Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs.
Score: 20.432924845981255
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal scientific reasoning remains a significant challenge for large language models (LLMs), particularly in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, we systematically evaluate 40 proprietary and open-source multimodal LLMs, including GPT-5, o3, Gemini-2.5-Pro, and Qwen2.5-VL, on a curated benchmark of Olympiad-style chemistry questions drawn from over two decades of U.S. National Chemistry Olympiad (USNCO) exams. These questions require integrated visual and textual reasoning across diverse modalities. We find that many models struggle with modality fusion, where in some cases, removing the image even improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding, as demonstrated through ablation studies and occlusion-based interpretability. Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs, providing actionable strategies for developing more robust and interpretable multimodal systems in chemistry. This work provides a timely benchmark for measuring progress in domain-specific multimodal AI and underscores the need for further advances at the intersection of artificial intelligence and scientific reasoning.

Related papers

RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature [25.978951548176706]
We introduce RxnBench, a benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs.<n> RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles.<n>Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition.
arXiv Detail & Related papers (2025-12-29T16:05:38Z)
HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery [50.8841471967624]
HiSciBench is a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow.<n>HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines.
arXiv Detail & Related papers (2025-12-28T12:08:05Z)
PENDULUM: A Benchmark for Assessing Sycophancy in Multimodal Large Language Models [43.767942065379366]
Sycophancy is a tendency of AI models to agree with user input at the expense of factual accuracy or in contradiction of visual evidence.<n>We introduce a comprehensive evaluation benchmark, textitPENDULUM, comprising approximately 2,000 human-curated Visual Question Answering pairs.<n>We observe substantial variability in model robustness and a pronounced susceptibility to sycophantic and hallucinatory behavior.
arXiv Detail & Related papers (2025-12-22T12:49:12Z)
ChemVTS-Bench: Evaluating Visual-Textual-Symbolic Reasoning of Multimodal Large Language Models in Chemistry [14.083820970280668]
ChemVTS-Bench is a domain-authentic benchmark designed to evaluate the Visual-Textual-Symbolic (VTS) reasoning abilities of Multimodal Large Language Models (MLLMs)<n>ChemVTS-Bench contains diverse and challenging chemical problems spanning organic molecules, inorganic materials, and 3D crystal structures.<n>We develop an automated agent-based workflow that standardizes inference, verifies answers, and diagnoses failure modes.
arXiv Detail & Related papers (2025-11-22T04:24:24Z)
Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems [15.023749693065406]
We introduce textbf Multi-Physics for Chinese physics reasoning, a comprehensive benchmark that includes 5 difficulty levels.<n>We employ a dual evaluation framework to evaluate 20 different MLLMs, analyzing both final answer accuracy and the step-by-step integrity of their chain-of-thought.
arXiv Detail & Related papers (2025-09-19T10:18:48Z)
$\text{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models [59.125833618091846]
We propose a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view.<n>Experiments demonstrate that $textM2$LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks.
arXiv Detail & Related papers (2025-08-12T05:46:47Z)
MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams [50.293164501645975]
Multimodal large language models (MLLMs) integrate language and visual cues for problem-solving.<n>Current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge.<n>We introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six disciplines.
arXiv Detail & Related papers (2025-08-09T06:21:10Z)
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.<n>EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.<n>Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z)
From Generalist to Specialist: A Survey of Large Language Models for Chemistry [14.317448405387195]
Large Language Models (LLMs) have significantly transformed our daily life and established a new paradigm in natural language processing (NLP)<n>The predominant pretraining of LLMs on extensive web-based texts remains insufficient for advanced scientific discovery, particularly in chemistry.<n>Although several studies have reviewed Pretrained Language Models (PLMs) in chemistry, there is a conspicuous absence of a systematic survey specifically focused on chemistry-oriented LLMs.
arXiv Detail & Related papers (2024-12-28T03:40:25Z)
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark [53.61633384281524]
PolyMATH is a benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. The best scores achieved on PolyMATH are 41%, 36%, and 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively. A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning.
arXiv Detail & Related papers (2024-10-06T20:35:41Z)
VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning [20.56989082014445]
Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks.<n>We present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning.<n>The best performance observed include a 53.4% accuracy in mathematics by Claude3.5-Sonnet, 38.2% in physics by GPT-4o, and 47.0% in chemistry by Gemini-1.5-Pro.
arXiv Detail & Related papers (2024-09-10T01:20:26Z)
ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area [70.66610054938052]
We introduce textbfChemVLM, an open-source chemical multimodal large language model for chemical applications.<n>ChemVLM is trained on a carefully curated bilingual dataset that enhances its ability to understand both textual and visual chemical information.<n>We benchmark ChemVLM against a range of open-source and proprietary multimodal large language models on various tasks.
arXiv Detail & Related papers (2024-08-14T01:16:40Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.