Related papers: SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

URL: http://arxiv.org/abs/2512.01274v1
Date: Mon, 01 Dec 2025 04:46:35 GMT
Title: SUPERChem: A Multimodal Reasoning Benchmark in Chemistry
Authors: Zehua Zhao, Zhixian Huang, Junren Li, Siyu Lin, Junting Zhou, Fengqi Cao, Kun Zhou, Rui Ge, Tingting Long, Yuexiang Zhu, Yan Liu, Jie Zheng, Junnian Wei, Rong Zhu, Peng Zou, Wenyu Li, Zekai Cheng, Tian Ding, Yaxuan Wang, Yizhao Yan, Tingru Wei, Haowei Ming, Weijie Mao, Chen Sun, Yiming Liu, Zichen Wang, Zuo Zhang, Tong Yang, Hao Ma, Zhen Gao, Jian Pei,
Abstract summary: SUPERChem is a benchmark of 500 expert-curated reasoning-intensive chemistry problems.<n>Each problem is paired with an expert-authored solution path.<n> Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%.
Score: 47.60627566673109
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence. The dataset of the benchmark is available at https://huggingface.co/datasets/ZehuaZhao/SUPERChem.

Related papers

RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature [25.978951548176706]
We introduce RxnBench, a benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs.<n> RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles.<n>Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition.
arXiv Detail & Related papers (2025-12-29T16:05:38Z)
ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025 [10.434011696348561]
ChemO is a new benchmark built from the International Chemistry Olympiad (IChO) 2025.<n>ChemLabs is a hierarchical multi-agent framework that mimics human expert collaboration.<n>Our top configuration achieves a score of 93.6 out of 100, surpassing an estimated human gold medal threshold.
arXiv Detail & Related papers (2025-11-20T10:15:39Z)
QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry [19.804237919102903]
QCBench is a Quantitative Chemistry oriented benchmark comprising 350 computational chemistry problems across 7 chemistry subfields.<n>Each problem is structured to prevent shortcuts and demand explicit numerical reasoning.<n>QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations, and lays the groundwork for future improvements.
arXiv Detail & Related papers (2025-08-03T08:55:42Z)
ChemAU: Harness the Reasoning of LLMs in Chemical Research with Adaptive Uncertainty Estimation [21.30938446415292]
Chemistry problems typically involve long and complex reasoning steps, which contain specific terminology.<n>ChemAU identifies gaps in chemistry knowledge and precisely supplements chemical expertise with the specialized domain model.
arXiv Detail & Related papers (2025-06-01T18:45:49Z)
ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning [64.2106664137118]
ChemAgent is a novel framework designed to improve the performance of large language models (LLMs)<n>It is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries.<n>When presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory.
arXiv Detail & Related papers (2025-01-11T17:10:30Z)
Benchmarking large language models for materials synthesis: the case of atomic layer deposition [0.07528462379265576]
We introduce an open-ended question benchmark, ALDbench, to evaluate the performance of large language models (LLMs) in materials synthesis.<n>Our benchmark comprises questions with a level of difficulty ranging from graduate level to domain expert current with the state of the art in the field.
arXiv Detail & Related papers (2024-12-13T05:10:29Z)
ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models [62.37850540570268]
Existing benchmarks in this domain fail to adequately meet the specific requirements of chemical research professionals. ChemEval identifies 4 crucial progressive levels in chemistry, assessing 12 dimensions of LLMs across 42 distinct chemical tasks. Results show that while general LLMs excel in literature understanding and instruction following, they fall short in tasks demanding advanced chemical knowledge.
arXiv Detail & Related papers (2024-09-21T02:50:43Z)
ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area [70.66610054938052]
We introduce textbfChemVLM, an open-source chemical multimodal large language model for chemical applications.<n>ChemVLM is trained on a carefully curated bilingual dataset that enhances its ability to understand both textual and visual chemical information.<n>We benchmark ChemVLM against a range of open-source and proprietary multimodal large language models on various tasks.
arXiv Detail & Related papers (2024-08-14T01:16:40Z)
ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering [54.80411755871931]
Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth. Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format. This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful. We introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data.
arXiv Detail & Related papers (2024-07-24T01:46:55Z)
ChemMiner: A Large Language Model Agent System for Chemical Literature Data Mining [56.15126714863963]
ChemMiner is an end-to-end framework for extracting chemical data from literature.<n>ChemMiner incorporates three specialized agents: a text analysis agent for coreference mapping, a multimodal agent for non-textual information extraction, and a synthesis analysis agent for data generation.<n> Experimental results demonstrate reaction identification rates comparable to human chemists while significantly reducing processing time, with high accuracy, recall, and F1 scores.
arXiv Detail & Related papers (2024-02-20T13:21:46Z)
ChemCrow: Augmenting large-language models with chemistry tools [0.9195187117013247]
Large-language models (LLMs) have shown strong performance in tasks across domains, but struggle with chemistry-related problems. In this study, we introduce ChemCrow, an LLM chemistry agent designed to accomplish tasks across organic synthesis, drug discovery, and materials design. Our agent autonomously planned and executed the syntheses of an insect repellent, three organocatalysts, and guided the discovery of a novel chromophore.
arXiv Detail & Related papers (2023-04-11T17:41:13Z)
Unassisted Noise Reduction of Chemical Reaction Data Sets [59.127921057012564]
We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets. Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
arXiv Detail & Related papers (2021-02-02T09:34:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.