RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature
- URL: http://arxiv.org/abs/2512.23565v2
- Date: Tue, 30 Dec 2025 15:18:47 GMT
- Title: RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature
- Authors: Hanzheng Li, Xi Fang, Yixuan Li, Chaozheng Huang, Junjie Wang, Xi Wang, Hongzhe Bai, Bojun Hao, Shenyu Lin, Huiqi Liang, Linfeng Zhang, Guolin Ke,
- Abstract summary: We introduce RxnBench, a benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs.<n> RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles.<n>Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition.
- Score: 25.978951548176706
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50\% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.
Related papers
- Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams [20.432924845981255]
Multimodal scientific reasoning remains a significant challenge for large language models (LLMs)<n>We systematically evaluate 40 proprietary and open-source multimodal LLMs on a curated benchmark of Olympiad-style chemistry questions.<n>Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs.
arXiv Detail & Related papers (2025-12-17T00:49:00Z) - ChemVTS-Bench: Evaluating Visual-Textual-Symbolic Reasoning of Multimodal Large Language Models in Chemistry [14.083820970280668]
ChemVTS-Bench is a domain-authentic benchmark designed to evaluate the Visual-Textual-Symbolic (VTS) reasoning abilities of Multimodal Large Language Models (MLLMs)<n>ChemVTS-Bench contains diverse and challenging chemical problems spanning organic molecules, inorganic materials, and 3D crystal structures.<n>We develop an automated agent-based workflow that standardizes inference, verifies answers, and diagnoses failure modes.
arXiv Detail & Related papers (2025-11-22T04:24:24Z) - oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning [44.36582860924775]
We introduce oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry.<n>We also propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity.
arXiv Detail & Related papers (2025-10-09T03:13:31Z) - QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry [19.804237919102903]
QCBench is a Quantitative Chemistry oriented benchmark comprising 350 computational chemistry problems across 7 chemistry subfields.<n>Each problem is structured to prevent shortcuts and demand explicit numerical reasoning.<n>QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations, and lays the groundwork for future improvements.
arXiv Detail & Related papers (2025-08-03T08:55:42Z) - A Multi-Agent System Enables Versatile Information Extraction from the Chemical Literature [8.306442315850878]
We develop a multimodal large language model (MLLM)-based multi-agent system for robust and automated chemical information extraction.<n>Our system achieved an F1 score of 80.8% on a benchmark dataset of sophisticated multimodal chemical reaction graphics from the literature.
arXiv Detail & Related papers (2025-07-27T11:16:57Z) - ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z) - Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables [48.39080455781475]
ChemTable is a large-scale benchmark of real-world chemical tables curated from the experimental sections of literature.<n>ChemTable includes expert-annotated cell polygons, logical layouts, and domain-specific labels, including reagents, catalysts, yields, and graphical components.<n>We evaluated a range of representative multimodal models, including both open-source and closed-source models, on ChemTable and reported a series of findings with practical and conceptual insights.
arXiv Detail & Related papers (2025-06-13T00:45:41Z) - MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research [57.61445960384384]
MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities.<n> Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53%.<n>Expert analysis of chain-of-thought responses shows perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors.
arXiv Detail & Related papers (2025-03-17T17:33:10Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science [62.96434290874878]
Current Multi-Modal Large Language Models (MLLM) have shown strong capabilities in general visual reasoning tasks.<n>We develop a new framework, named Multi-Modal Scientific Reasoning with Physics Perception and Simulation (MAPS) based on an MLLM.<n>MAPS decomposes expert-level multi-modal reasoning task into physical diagram understanding via a Physical Perception Model (PPM) and reasoning with physical knowledge via a simulator.
arXiv Detail & Related papers (2025-01-18T13:54:00Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.