How to Make Large Language Models Generate 100% Valid Molecules?
- URL: http://arxiv.org/abs/2509.23099v1
- Date: Sat, 27 Sep 2025 04:14:19 GMT
- Title: How to Make Large Language Models Generate 100% Valid Molecules?
- Authors: Wen Tao, Jing Tang, Alvin Chan, Bryan Hooi, Baolong Bi, Nanyun Peng, Yuansheng Liu, Yiwei Wang,
- Abstract summary: Large language models (LLMs) can learn to perform a wide range of tasks from just a few examples.<n> generating valid molecules using representations like SMILES is challenging for LLMs in few-shot settings.<n>We introduce SmiSelf, a cross-chemical language framework for invalid SMILES correction.
- Score: 82.89165081942201
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Molecule generation is key to drug discovery and materials science, enabling the design of novel compounds with specific properties. Large language models (LLMs) can learn to perform a wide range of tasks from just a few examples. However, generating valid molecules using representations like SMILES is challenging for LLMs in few-shot settings. In this work, we explore how LLMs can generate 100% valid molecules. We evaluate whether LLMs can use SELFIES, a representation where every string corresponds to a valid molecule, for valid molecule generation but find that LLMs perform worse with SELFIES than with SMILES. We then examine LLMs' ability to correct invalid SMILES and find their capacity limited. Finally, we introduce SmiSelf, a cross-chemical language framework for invalid SMILES correction. SmiSelf converts invalid SMILES to SELFIES using grammatical rules, leveraging SELFIES' mechanisms to correct the invalid SMILES. Experiments show that SmiSelf ensures 100% validity while preserving molecular characteristics and maintaining or even enhancing performance on other metrics. SmiSelf helps expand LLMs' practical applications in biomedicine and is compatible with all SMILES-based generative models. Code is available at https://github.com/wentao228/SmiSelf.
Related papers
- ChemMLLM: Chemical Multimodal Large Language Model [52.95382215206681]
We propose ChemMLLM, a unified chemical multimodal large language model for molecule understanding and generation.<n>Also, we design five multimodal tasks across text, molecular SMILES strings, and image, and curate the datasets.<n> Experimental results show that ChemMLLM achieves superior performance across all evaluated tasks.
arXiv Detail & Related papers (2025-05-22T07:32:17Z) - An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z) - mCLM: A Function-Infused and Synthesis-Friendly Modular Chemical Language Model [65.69164455183956]
We propose mCLM, a modular Chemical-Language Model tokenizing molecules into building blocks and learning a bilingual language model of both natural language descriptions of functions and molecule building blocks.<n>In experiments on 430 FDA-approved drugs, we find mCLM capable of significantly improving 5 out of 6 chemical functions critical to determining drug potentials.
arXiv Detail & Related papers (2025-05-18T22:52:39Z) - CUTE: Measuring LLMs' Understanding of Their Tokens [54.70665106141121]
Large Language Models (LLMs) show remarkable performance on a wide variety of tasks.
This raises the question: To what extent can LLMs learn orthographic information?
We propose a new benchmark, which features a collection of tasks designed to test the orthographic knowledge of LLMs.
arXiv Detail & Related papers (2024-09-23T18:27:03Z) - SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration [2.5159482339113084]
We show that a general-purpose large language model (LLM) can be transformed into a chemical language model (CLM)<n>We benchmark SmileyLlama by comparing it to CLMs trained from scratch on large amounts of ChEMBL data for their ability to generate valid and novel drug-like molecules.
arXiv Detail & Related papers (2024-09-03T18:59:20Z) - MolX: Enhancing Large Language Models for Molecular Understanding With A Multi-Modal Extension [44.97089022713424]
Large Language Models (LLMs) with their strong task-handling capabilities have shown remarkable advancements across a spectrum of fields.<n>This study seeks to enhance the ability of LLMs to comprehend molecules by equipping them with a multi-modal external module, termed MolX.<n>A hand-crafted molecular fingerprint is incorporated to leverage its embedded domain knowledge.
arXiv Detail & Related papers (2024-06-10T20:25:18Z) - Can Large Language Models Empower Molecular Property Prediction? [16.5246941211725]
Molecular property prediction has gained significant attention due to its transformative potential in scientific disciplines.
Recently, the rapid development of Large Language Models (LLMs) has revolutionized the field of NLP.
In this work, we advance towards this objective through two perspectives: zero/few-shot molecular classification, and using the new explanations generated by LLMs as representations of molecules.
arXiv Detail & Related papers (2023-07-14T16:06:42Z) - Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks.
In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation.
We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.