Assessing the Chemical Intelligence of Large Language Models
- URL: http://arxiv.org/abs/2505.07735v1
- Date: Mon, 12 May 2025 16:44:38 GMT
- Title: Assessing the Chemical Intelligence of Large Language Models
- Authors: Nicholas T. Runcie, Charlotte M. Deane, Fergus Imrie,
- Abstract summary: Large Language Models are versatile, general-purpose tools with a wide range of applications.<n>We created a novel benchmark, called ChemIQ, which consists of 796 questions assessing core concepts in organic chemistry.<n>We found that the latest reasoning models can elucidate structures from 1H and 13C NMR data, correctly generating SMILES strings for 74% of molecules containing up to 10 heavy atoms, and in one case solving a structure comprising 21 heavy atoms.
- Score: 12.254249246104655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models are versatile, general-purpose tools with a wide range of applications. Recently, the advent of "reasoning models" has led to substantial improvements in their abilities in advanced problem-solving domains such as mathematics and software engineering. In this work, we assessed the ability of reasoning models to directly perform chemistry tasks, without any assistance from external tools. We created a novel benchmark, called ChemIQ, which consists of 796 questions assessing core concepts in organic chemistry, focused on molecular comprehension and chemical reasoning. Unlike previous benchmarks, which primarily use multiple choice formats, our approach requires models to construct short-answer responses, more closely reflecting real-world applications. The reasoning models, exemplified by OpenAI's o3-mini, correctly answered 28%-59% of questions depending on the reasoning level used, with higher reasoning levels significantly increasing performance on all tasks. These models substantially outperformed the non-reasoning model, GPT-4o, which achieved only 7% accuracy. We found that Large Language Models can now convert SMILES strings to IUPAC names, a task earlier models were unable to perform. Additionally, we show that the latest reasoning models can elucidate structures from 1H and 13C NMR data, correctly generating SMILES strings for 74% of molecules containing up to 10 heavy atoms, and in one case solving a structure comprising 21 heavy atoms. For each task, we found evidence that the reasoning process mirrors that of a human chemist. Our results demonstrate that the latest reasoning models have the ability to perform advanced chemical reasoning.
Related papers
- MolReasoner: Toward Effective and Interpretable Reasoning for Molecular LLMs [30.030008221150407]
MolReasoner is a two-stage framework designed to transition Large Language Models from memorization towards chemical reasoning.<n>First, we propose Mol-SFT, which initializes the model's reasoning abilities via synthetic Chain-of-Thought(CoT) samples generated by GPT-4o and verified for chemical accuracy.<n>Subsequently, Mol-RL applies reinforcement learning with specialized reward functions designed explicitly to align chemical structures with linguistic descriptions.
arXiv Detail & Related papers (2025-08-04T05:10:11Z) - UMA: A Family of Universal Models for Atoms [16.3404265902621]
We present a family of Universal Models for Atoms (UMA), designed to push the frontier of speed, accuracy, and generalization.<n>UMA models are trained on half a billion unique 3D atomic structures by compiling data across multiple chemical domains.<n>We evaluate UMA models on a diverse set of applications across multiple domains and find that, remarkably, a single model without any fine-tuning can perform similarly or better than specialized models.
arXiv Detail & Related papers (2025-06-30T15:38:13Z) - Training a Scientific Reasoning Model for Chemistry [3.52064464182155]
We demonstrate that reasoning models can be post-trained for chemistry without additional domain pretraining.<n>We report ether0, a 24B parameter LLM that can reason in natural language and respond with chemical structures.
arXiv Detail & Related papers (2025-06-04T17:57:18Z) - Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning [66.43194385702297]
Large Language Models (LLMs) have shown strong reasoning capabilities, particularly when enhanced through Reinforcement Learning (RL)<n>We propose NEMOTRON-CROSSTHINK, a framework that systematically incorporates multi-domain corpora, including both synthetic and real-world question-answer pairs, into RL training to improve generalization across diverse reasoning tasks.
arXiv Detail & Related papers (2025-04-15T21:37:13Z) - Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models [48.98109982725689]
We conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families.<n>Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths.<n>We identify model size, model origin, and task difficulty as critical determinants of performance.
arXiv Detail & Related papers (2025-04-07T08:22:45Z) - ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning [64.2106664137118]
ChemAgent is a novel framework designed to improve the performance of large language models (LLMs)<n>It is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries.<n>When presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory.
arXiv Detail & Related papers (2025-01-11T17:10:30Z) - ProcessBench: Identifying Process Errors in Mathematical Reasoning [62.80402845414901]
We introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning.<n>ProcessBench consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems.<n>We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models.
arXiv Detail & Related papers (2024-12-09T15:11:40Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - ChemLLM: A Chemical Large Language Model [49.308528569982805]
Large language models (LLMs) have made impressive progress in chemistry applications.
However, the community lacks an LLM specifically designed for chemistry.
Here, we introduce ChemLLM, a comprehensive framework that features the first LLM dedicated to chemistry.
arXiv Detail & Related papers (2024-02-10T01:11:59Z) - ChemDFM: A Large Language Foundation Model for Chemistry [27.864255196445324]
A more generic and efficient solution would be an AI model that could address many tasks and support free-form dialogue in the broad field of chemistry.
We develop ChemDFM, a pioneering LLM for chemistry trained on 34B tokens from chemical literature and textbooks, and fine-tuned using 2.7M instructions.
We have open-sourced the inference codes, evaluation datasets, and model weights of ChemDFM on Huggingface.
arXiv Detail & Related papers (2024-01-26T12:45:55Z) - What can Large Language Models do in chemistry? A comprehensive
benchmark on eight tasks [41.9830989458936]
Large Language Models (LLMs) with strong abilities in natural language processing tasks have emerged.
We aim to evaluate capabilities of LLMs in a wide range of tasks across the chemistry domain.
arXiv Detail & Related papers (2023-05-27T14:17:33Z) - Specializing Smaller Language Models towards Multi-Step Reasoning [56.78474185485288]
We show that abilities can be distilled down from GPT-3.5 ($ge$ 175B) to T5 variants ($le$ 11B)
We propose model specialization, to specialize the model's ability towards a target task.
arXiv Detail & Related papers (2023-01-30T08:51:19Z) - ChemAlgebra: Algebraic Reasoning on Chemical Reactions [16.93639996082923]
It is unclear whether deep learning models have the ability to tackle reasoning tasks.
ChemAlgebra is a benchmark for measuring the reasoning capabilities of deep learning models.
arXiv Detail & Related papers (2022-10-05T08:34:44Z) - Learning Latent Space Energy-Based Prior Model for Molecule Generation [59.875533935578375]
We learn latent space energy-based prior model with SMILES representation for molecule modeling.
Our method is able to generate molecules with validity and uniqueness competitive with state-of-the-art models.
arXiv Detail & Related papers (2020-10-19T09:34:20Z) - Self-Supervised Graph Transformer on Large-Scale Molecular Data [73.3448373618865]
We propose a novel framework, GROVER, for molecular representation learning.
GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data.
We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning.
arXiv Detail & Related papers (2020-06-18T08:37:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.