Related papers: ChemPro: A Progressive Chemistry Benchmark for Large Language Models

ChemPro: A Progressive Chemistry Benchmark for Large Language Models

URL: http://arxiv.org/abs/2602.03108v1
Date: Tue, 03 Feb 2026 05:08:08 GMT
Title: ChemPro: A Progressive Chemistry Benchmark for Large Language Models
Authors: Aaditya Baranwal, Shruti Vyas,
Abstract summary: ChemPro is a progressive benchmark with 4100 natural language question-answer pairs in Chemistry.<n>It is designed to assess the proficiency of Large Language Models (LLMs) in a broad spectrum of general chemistry topics.
Score: 4.3441332321802095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce ChemPro, a progressive benchmark with 4100 natural language question-answer pairs in Chemistry, across 4 coherent sections of difficulty designed to assess the proficiency of Large Language Models (LLMs) in a broad spectrum of general chemistry topics. We include Multiple Choice Questions and Numerical Questions spread across fine-grained information recall, long-horizon reasoning, multi-concept questions, problem-solving with nuanced articulation, and straightforward questions in a balanced ratio, effectively covering Bio-Chemistry, Inorganic-Chemistry, Organic-Chemistry and Physical-Chemistry. ChemPro is carefully designed analogous to a student's academic evaluation for basic to high-school chemistry. A gradual increase in the question difficulty rigorously tests the ability of LLMs to progress from solving basic problems to solving more sophisticated challenges. We evaluate 45+7 state-of-the-art LLMs, spanning both open-source and proprietary variants, and our analysis reveals that while LLMs perform well on basic chemistry questions, their accuracy declines with different types and levels of complexity. These findings highlight the critical limitations of LLMs in general scientific reasoning and understanding and point towards understudied dimensions of difficulty, emphasizing the need for more robust methodologies to improve LLMs.

Related papers

SUPERChem: A Multimodal Reasoning Benchmark in Chemistry [47.60627566673109]
SUPERChem is a benchmark of 500 expert-curated reasoning-intensive chemistry problems.<n>Each problem is paired with an expert-authored solution path.<n> Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%.
arXiv Detail & Related papers (2025-12-01T04:46:35Z)
ChemOrch: Empowering LLMs with Chemical Intelligence via Synthetic Instructions [52.79349601462865]
ChemOrch is a framework that synthesizes chemically grounded instruction-response pairs.<n>ChemOrch enables controllable diversity and levels of difficulty for the generated tasks.
arXiv Detail & Related papers (2025-09-20T05:43:58Z)
QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry [19.804237919102903]
QCBench is a Quantitative Chemistry oriented benchmark comprising 350 computational chemistry problems across 7 chemistry subfields.<n>Each problem is structured to prevent shortcuts and demand explicit numerical reasoning.<n>QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations, and lays the groundwork for future improvements.
arXiv Detail & Related papers (2025-08-03T08:55:42Z)
ChemAU: Harness the Reasoning of LLMs in Chemical Research with Adaptive Uncertainty Estimation [21.30938446415292]
Chemistry problems typically involve long and complex reasoning steps, which contain specific terminology.<n>ChemAU identifies gaps in chemistry knowledge and precisely supplements chemical expertise with the specialized domain model.
arXiv Detail & Related papers (2025-06-01T18:45:49Z)
From Generalist to Specialist: A Survey of Large Language Models for Chemistry [14.317448405387195]
Large Language Models (LLMs) have significantly transformed our daily life and established a new paradigm in natural language processing (NLP)<n>The predominant pretraining of LLMs on extensive web-based texts remains insufficient for advanced scientific discovery, particularly in chemistry.<n>Although several studies have reviewed Pretrained Language Models (PLMs) in chemistry, there is a conspicuous absence of a systematic survey specifically focused on chemistry-oriented LLMs.
arXiv Detail & Related papers (2024-12-28T03:40:25Z)
ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models [62.37850540570268]
Existing benchmarks in this domain fail to adequately meet the specific requirements of chemical research professionals. ChemEval identifies 4 crucial progressive levels in chemistry, assessing 12 dimensions of LLMs across 42 distinct chemical tasks. Results show that while general LLMs excel in literature understanding and instruction following, they fall short in tasks demanding advanced chemical knowledge.
arXiv Detail & Related papers (2024-09-21T02:50:43Z)
ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area [70.66610054938052]
We introduce textbfChemVLM, an open-source chemical multimodal large language model for chemical applications.<n>ChemVLM is trained on a carefully curated bilingual dataset that enhances its ability to understand both textual and visual chemical information.<n>We benchmark ChemVLM against a range of open-source and proprietary multimodal large language models on various tasks.
arXiv Detail & Related papers (2024-08-14T01:16:40Z)
ChemLLM: A Chemical Large Language Model [49.308528569982805]
Large language models (LLMs) have made impressive progress in chemistry applications. However, the community lacks an LLM specifically designed for chemistry. Here, we introduce ChemLLM, a comprehensive framework that features the first LLM dedicated to chemistry.
arXiv Detail & Related papers (2024-02-10T01:11:59Z)
Structured Chemistry Reasoning with Large Language Models [70.13959639460015]
Large Language Models (LLMs) excel in diverse areas, yet struggle with complex scientific reasoning, especially in chemistry. We introduce StructChem, a simple yet effective prompting strategy that offers the desired guidance and substantially boosts the LLMs' chemical reasoning capability. Tests across four chemistry areas -- quantum chemistry, mechanics, physical chemistry, and kinetics -- StructChem substantially enhances GPT-4's performance, with up to 30% peak improvement.
arXiv Detail & Related papers (2023-11-16T08:20:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.