Related papers: QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry

QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry

URL: http://arxiv.org/abs/2508.01670v2
Date: Sat, 04 Oct 2025 05:53:57 GMT
Title: QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry
Authors: Jiaqing Xie, Weida Wang, Ben Gao, Zhuo Yang, Haiyuan Wan, Shufei Zhang, Tianfan Fu, Yuqiang Li,
Abstract summary: QCBench is a Quantitative Chemistry oriented benchmark comprising 350 computational chemistry problems across 7 chemistry subfields.<n>Each problem is structured to prevent shortcuts and demand explicit numerical reasoning.<n>QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations, and lays the groundwork for future improvements.
Score: 19.804237919102903
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantitative chemistry is central to modern chemical research, yet the ability of large language models (LLMs) to perform its rigorous, step-by-step calculations remains underexplored. To fill this blank, we propose QCBench, a Quantitative Chemistry oriented benchmark comprising 350 computational chemistry problems across 7 chemistry subfields, which contains analytical chemistry, bio/organic chemistry, general chemistry, inorganic chemistry, physical chemistry, polymer chemistry and quantum chemistry. To systematically evaluate the mathematical reasoning abilities of large language models (LLMs), they are categorized into three tiers: easy, medium, and difficult. Each problem, rooted in realistic chemical scenarios, is structured to prevent heuristic shortcuts and demand explicit numerical reasoning. QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations across difficulty levels, and lays the groundwork for future improvements such as domain-adaptive fine-tuning or multi-modal integration. Evaluations on 24 LLMs demonstrate a consistent performance degradation with increasing task complexity, highlighting the current gap between language fluency and scientific computation accuracy. Code for QCBench is available at https://github.com/jiaqingxie/QCBench.

Related papers

ChemPro: A Progressive Chemistry Benchmark for Large Language Models [4.3441332321802095]
ChemPro is a progressive benchmark with 4100 natural language question-answer pairs in Chemistry.<n>It is designed to assess the proficiency of Large Language Models (LLMs) in a broad spectrum of general chemistry topics.
arXiv Detail & Related papers (2026-02-03T05:08:08Z)
RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature [25.978951548176706]
We introduce RxnBench, a benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs.<n> RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles.<n>Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition.
arXiv Detail & Related papers (2025-12-29T16:05:38Z)
ChemVTS-Bench: Evaluating Visual-Textual-Symbolic Reasoning of Multimodal Large Language Models in Chemistry [14.083820970280668]
ChemVTS-Bench is a domain-authentic benchmark designed to evaluate the Visual-Textual-Symbolic (VTS) reasoning abilities of Multimodal Large Language Models (MLLMs)<n>ChemVTS-Bench contains diverse and challenging chemical problems spanning organic molecules, inorganic materials, and 3D crystal structures.<n>We develop an automated agent-based workflow that standardizes inference, verifies answers, and diagnoses failure modes.
arXiv Detail & Related papers (2025-11-22T04:24:24Z)
ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge [14.6026550444088]
This work focuses on the specific field of chemistry and develop a Chemical Reasoner LLM, ChemDFM-R.<n>We first construct a comprehensive dataset of atomized knowledge points to enhance the model's understanding of the fundamental principles and logical structure of chemistry.<n> Experiments on diverse chemical benchmarks demonstrate that ChemDFM-R achieves cutting-edge performance while providing interpretable, rationale-driven outputs.
arXiv Detail & Related papers (2025-07-29T16:40:49Z)
Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables [48.39080455781475]
ChemTable is a large-scale benchmark of real-world chemical tables curated from the experimental sections of literature.<n>ChemTable includes expert-annotated cell polygons, logical layouts, and domain-specific labels, including reagents, catalysts, yields, and graphical components.<n>We evaluated a range of representative multimodal models, including both open-source and closed-source models, on ChemTable and reported a series of findings with practical and conceptual insights.
arXiv Detail & Related papers (2025-06-13T00:45:41Z)
ChemAU: Harness the Reasoning of LLMs in Chemical Research with Adaptive Uncertainty Estimation [21.30938446415292]
Chemistry problems typically involve long and complex reasoning steps, which contain specific terminology.<n>ChemAU identifies gaps in chemistry knowledge and precisely supplements chemical expertise with the specialized domain model.
arXiv Detail & Related papers (2025-06-01T18:45:49Z)
Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations [43.623140005091535]
We introduce ChemCoTBench, a reasoning framework that bridges molecular structure understanding with arithmetic-inspired operations.<n>ChemCoTBench formalizes chemical problem-solving into transparent, step-by-step reasoning.<n>We evaluate models on two high-impact tasks: Molecular Property Optimization and Chemical Reaction Prediction.
arXiv Detail & Related papers (2025-05-27T15:15:44Z)
ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning [64.2106664137118]
ChemAgent is a novel framework designed to improve the performance of large language models (LLMs)<n>It is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries.<n>When presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory.
arXiv Detail & Related papers (2025-01-11T17:10:30Z)
ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models [62.37850540570268]
Existing benchmarks in this domain fail to adequately meet the specific requirements of chemical research professionals. ChemEval identifies 4 crucial progressive levels in chemistry, assessing 12 dimensions of LLMs across 42 distinct chemical tasks. Results show that while general LLMs excel in literature understanding and instruction following, they fall short in tasks demanding advanced chemical knowledge.
arXiv Detail & Related papers (2024-09-21T02:50:43Z)
ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area [50.15254966969718]
We introduce textbfChemVLM, an open-source chemical multimodal large language model for chemical applications.<n>ChemVLM is trained on a carefully curated bilingual dataset that enhances its ability to understand both textual and visual chemical information.<n>We benchmark ChemVLM against a range of open-source and proprietary multimodal large language models on various tasks.
arXiv Detail & Related papers (2024-08-14T01:16:40Z)
ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering [54.80411755871931]
Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth. Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format. This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful. We introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data.
arXiv Detail & Related papers (2024-07-24T01:46:55Z)
Are large language models superhuman chemists? [4.87961182129702]
Large language models (LLMs) have gained widespread interest due to their ability to process human language and perform tasks on which they have not been explicitly trained. Here, we introduce "ChemBench," an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs. We curated more than 2,700 question-answer pairs, evaluated leading open- and closed-source LLMs, and found that the best models outperformed the best human chemists.
arXiv Detail & Related papers (2024-04-01T20:56:25Z)
ChemLLM: A Chemical Large Language Model [49.308528569982805]
Large language models (LLMs) have made impressive progress in chemistry applications. However, the community lacks an LLM specifically designed for chemistry. Here, we introduce ChemLLM, a comprehensive framework that features the first LLM dedicated to chemistry.
arXiv Detail & Related papers (2024-02-10T01:11:59Z)
Structured Chemistry Reasoning with Large Language Models [70.13959639460015]
Large Language Models (LLMs) excel in diverse areas, yet struggle with complex scientific reasoning, especially in chemistry. We introduce StructChem, a simple yet effective prompting strategy that offers the desired guidance and substantially boosts the LLMs' chemical reasoning capability. Tests across four chemistry areas -- quantum chemistry, mechanics, physical chemistry, and kinetics -- StructChem substantially enhances GPT-4's performance, with up to 30% peak improvement.
arXiv Detail & Related papers (2023-11-16T08:20:36Z)
ChemAlgebra: Algebraic Reasoning on Chemical Reactions [16.93639996082923]
It is unclear whether deep learning models have the ability to tackle reasoning tasks. ChemAlgebra is a benchmark for measuring the reasoning capabilities of deep learning models.
arXiv Detail & Related papers (2022-10-05T08:34:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.