mCLM: A Function-Infused and Synthesis-Friendly Modular Chemical Language Model
- URL: http://arxiv.org/abs/2505.12565v1
- Date: Sun, 18 May 2025 22:52:39 GMT
- Title: mCLM: A Function-Infused and Synthesis-Friendly Modular Chemical Language Model
- Authors: Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Bowen Jin, Chetan Kumar Prasad, Sara Szymkuć, Bartosz A. Grzybowski, Ying Diao, Jiawei Han, Ge Liu, Hao Peng, Martin D. Burke, Heng Ji,
- Abstract summary: We propose mCLM, a modular Chemical-Language Model tokenizing molecules into building blocks and learning a bilingual language model of both natural language descriptions of functions and molecule building blocks.<n>In experiments on 430 FDA-approved drugs, we find mCLM capable of significantly improving 5 out of 6 chemical functions critical to determining drug potentials.
- Score: 65.69164455183956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite their ability to understand chemical knowledge and accurately generate sequential representations, large language models (LLMs) remain limited in their capacity to propose novel molecules with drug-like properties. In addition, the molecules that LLMs propose can often be challenging to make in the lab. To more effectively enable the discovery of functional small molecules, LLMs need to learn a molecular language. However, LLMs are currently limited by encoding molecules from atoms. In this paper, we argue that just like tokenizing texts into (sub-)word tokens instead of characters, molecules should be decomposed and reassembled at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as effective building blocks for real-world automated laboratory synthesis. This motivates us to propose mCLM, a modular Chemical-Language Model tokenizing molecules into building blocks and learning a bilingual language model of both natural language descriptions of functions and molecule building blocks. By reasoning on such functional building blocks, mCLM guarantees to generate efficiently synthesizable molecules thanks to recent progress in block-based chemistry, while also improving the functions of molecules in a principled manner. In experiments on 430 FDA-approved drugs, we find mCLM capable of significantly improving 5 out of 6 chemical functions critical to determining drug potentials. More importantly, mCLM can reason on multiple functions and improve the FDA-rejected drugs (``fallen angels'') over multiple iterations to greatly improve their shortcomings.
Related papers
- Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model [55.87790704067848]
Mol-LLaMA is a large molecular language model that grasps the general knowledge centered on molecules.<n>To improve molecular understanding, we propose a module that integrates complementary information from different molecular encoders.
arXiv Detail & Related papers (2025-02-19T05:49:10Z) - Can LLMs Generate Diverse Molecules? Towards Alignment with Structural Diversity [16.964217425866746]
We propose a new method for fine-tuning molecular generative LLMs to autoregressively generate a set of structurally diverse molecules.<n>Our approach consists of two stages: (1) supervised fine-tuning to adapt LLMs to autoregressively generate molecules in a sequence and (2) reinforcement learning to maximize structural diversity within the generated molecules.
arXiv Detail & Related papers (2024-10-04T04:25:36Z) - FARM: Functional Group-Aware Representations for Small Molecules [55.281754551202326]
We introduce Functional Group-Aware Representations for Small Molecules (FARM)<n>FARM is a novel model designed to bridge the gap between SMILES, natural language, and molecular graphs.<n>We evaluate FARM on the MoleculeNet dataset, where it achieves state-of-the-art performance on 11 out of 13 tasks.
arXiv Detail & Related papers (2024-10-02T23:04:58Z) - SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration [2.5159482339113084]
We show that a general-purpose large language model (LLM) can be transformed into a chemical language model (CLM)<n>We benchmark SmileyLlama by comparing it to CLMs trained from scratch on large amounts of ChEMBL data for their ability to generate valid and novel drug-like molecules.
arXiv Detail & Related papers (2024-09-03T18:59:20Z) - MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension [34.586861881519134]
Large Language Models (LLMs) with their strong task-handling capabilities have shown remarkable advancements across a spectrum of fields.<n>This study seeks to enhance the ability of LLMs to comprehend molecules by equipping them with a multi-modal external module, namely MolX.<n>In particular, instead of directly using a SMILES string to represent a molecule, we utilize specific encoders to extract fine-grained features from both SMILES string and 2D molecular graph representations.
arXiv Detail & Related papers (2024-06-10T20:25:18Z) - LDMol: Text-to-Molecule Diffusion Model with Structurally Informative Latent Space [55.5427001668863]
We present a novel latent diffusion model dubbed LDMol for text-conditioned molecule generation.
LDMol comprises a molecule autoencoder that produces a learnable and structurally informative feature space.
We show that LDMol can be applied to downstream tasks such as molecule-to-text retrieval and text-guided molecule editing.
arXiv Detail & Related papers (2024-05-28T04:59:13Z) - DrugLLM: Open Large Language Model for Few-shot Molecule Generation [20.680942401843772]
DrugLLM learns how to modify molecules in drug discovery by predicting the next molecule based on past modifications.
In computational experiments, DrugLLM can generate new molecules with expected properties based on limited examples.
arXiv Detail & Related papers (2024-05-07T09:18:13Z) - Interactive Molecular Discovery with Natural Language [69.89287960545903]
We propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules.
To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information.
arXiv Detail & Related papers (2023-06-21T02:05:48Z) - Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks.
In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation.
We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z) - Scalable Fragment-Based 3D Molecular Design with Reinforcement Learning [68.8204255655161]
We introduce a novel framework for scalable 3D design that uses a hierarchical agent to build molecules.
In a variety of experiments, we show that our agent, guided only by energy considerations, can efficiently learn to produce molecules with over 100 atoms.
arXiv Detail & Related papers (2022-02-01T18:54:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.