Training a Scientific Reasoning Model for Chemistry
- URL: http://arxiv.org/abs/2506.17238v1
- Date: Wed, 04 Jun 2025 17:57:18 GMT
- Title: Training a Scientific Reasoning Model for Chemistry
- Authors: Siddharth M. Narayanan, James D. Braza, Ryan-Rhys Griffiths, Albert Bou, Geemi Wellawatte, Mayk Caldas Ramos, Ludovico Mitchener, Samuel G. Rodriques, Andrew D. White,
- Abstract summary: We demonstrate that reasoning models can be post-trained for chemistry without additional domain pretraining.<n>We report ether0, a 24B parameter LLM that can reason in natural language and respond with chemical structures.
- Score: 3.52064464182155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning models are large language models that emit a long chain-of-thought before answering, providing both higher accuracy and explicit reasoning for their response. A major question has been whether language model reasoning generalizes beyond mathematics, programming, and logic, where most previous work has focused. We demonstrate that reasoning models can be post-trained for chemistry without additional domain pretraining, and require substantially less data compared to contemporary domain-specific models. We report ether0, a 24B parameter LLM (based on Mistral-Small-24B) that can reason in natural language and respond with chemical structures. This reasoning model was trained with reinforcement learning on 640,730 experimentally-grounded chemistry problems across 375 tasks ranging from synthesizability, to blood-brain barrier permeability, to human receptor activity, to scent. Our model exceeds general-purpose chemistry models, frontier models, and human experts on molecular design tasks. It is also more data efficient relative to specialized models. We anticipate that this method can be applied to train data-efficient language models specialized for tasks across a wide variety of scientific domains.
Related papers
- Evolution without Large Models: Training Language Model with Task Principles [52.44569608690695]
A common training approach for language models involves using a large-scale language model to expand a human-provided dataset.<n>This method significantly reduces training costs by eliminating the need for extensive human data annotation.<n>However, it still faces challenges such as high carbon emissions during data augmentation and the risk of data leakage.
arXiv Detail & Related papers (2025-07-08T13:52:45Z) - Assessing the Chemical Intelligence of Large Language Models [12.254249246104655]
Large Language Models are versatile, general-purpose tools with a wide range of applications.<n>We created a novel benchmark, called ChemIQ, which consists of 796 questions assessing core concepts in organic chemistry.<n>We found that the latest reasoning models can elucidate structures from 1H and 13C NMR data, correctly generating SMILES strings for 74% of molecules containing up to 10 heavy atoms, and in one case solving a structure comprising 21 heavy atoms.
arXiv Detail & Related papers (2025-05-12T16:44:38Z) - LICO: Large Language Models for In-Context Molecular Optimization [33.5918976228562]
We introduce LICO, a general-purpose model that extends arbitrary base LLMs for black-box optimization.
We train the model to perform in-context predictions on a diverse set of functions defined over the domain.
Once trained, LICO can generalize to unseen molecule properties simply via in-context prompting.
arXiv Detail & Related papers (2024-06-27T02:43:18Z) - ChemLLM: A Chemical Large Language Model [49.308528569982805]
Large language models (LLMs) have made impressive progress in chemistry applications.
However, the community lacks an LLM specifically designed for chemistry.
Here, we introduce ChemLLM, a comprehensive framework that features the first LLM dedicated to chemistry.
arXiv Detail & Related papers (2024-02-10T01:11:59Z) - Developing ChemDFM as a large language foundation model for chemistry [27.864255196445324]
A more generic and efficient solution would be an AI model that could address many tasks and support free-form dialogue in the broad field of chemistry.<n>We develop ChemDFM, a pioneering LLM for chemistry trained on 34B tokens from chemical literature and textbooks, and fine-tuned using 2.7M instructions.<n>We have open-sourced the inference codes, evaluation datasets, and model weights of ChemDFM on Huggingface.
arXiv Detail & Related papers (2024-01-26T12:45:55Z) - Specializing Smaller Language Models towards Multi-Step Reasoning [56.78474185485288]
We show that abilities can be distilled down from GPT-3.5 ($ge$ 175B) to T5 variants ($le$ 11B)
We propose model specialization, to specialize the model's ability towards a target task.
arXiv Detail & Related papers (2023-01-30T08:51:19Z) - Unifying Molecular and Textual Representations via Multi-task Language
Modelling [11.474894472719543]
We propose the first multi-domain, multi-task language model that can solve a wide range of tasks in both the chemical and natural language domains.
Our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models.
Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences.
arXiv Detail & Related papers (2023-01-29T23:56:45Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z) - Solving Quantitative Reasoning Problems with Language Models [53.53969870599973]
We introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content.
The model achieves state-of-the-art performance on technical benchmarks without the use of external tools.
We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences.
arXiv Detail & Related papers (2022-06-29T18:54:49Z) - Keeping it Simple: Language Models can learn Complex Molecular
Distributions [0.0]
We introduce several challenging generative modeling tasks by compiling especially complex distributions of molecules.
The results demonstrate that language models are powerful generative models, capable of adeptly learning complex molecular distributions.
arXiv Detail & Related papers (2021-12-06T13:40:58Z) - To what extent do human explanations of model behavior align with actual
model behavior? [91.67905128825402]
We investigated the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions.
We defined two alignment metrics that quantify how well natural language human explanations align with model sensitivity to input words.
We find that a model's alignment with human explanations is not predicted by the model's accuracy on NLI.
arXiv Detail & Related papers (2020-12-24T17:40:06Z) - Learning Latent Space Energy-Based Prior Model for Molecule Generation [59.875533935578375]
We learn latent space energy-based prior model with SMILES representation for molecule modeling.
Our method is able to generate molecules with validity and uniqueness competitive with state-of-the-art models.
arXiv Detail & Related papers (2020-10-19T09:34:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.