SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration
- URL: http://arxiv.org/abs/2409.02231v3
- Date: Mon, 30 Jun 2025 10:36:58 GMT
- Title: SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration
- Authors: Joseph M. Cavanagh, Kunyang Sun, Andrew Gritsevskiy, Dorian Bagni, Yingze Wang, Thomas D. Bannister, Teresa Head-Gordon,
- Abstract summary: We show that a general-purpose large language model (LLM) can be transformed into a chemical language model (CLM)<n>We benchmark SmileyLlama by comparing it to CLMs trained from scratch on large amounts of ChEMBL data for their ability to generate valid and novel drug-like molecules.
- Score: 2.5159482339113084
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Here we show that a general-purpose large language model (LLM) chatbot, Llama-3.1-8B-Instruct, can be transformed via supervised fine-tuning of engineered prompts into a chemical language model (CLM), SmileyLlama, for molecule generation. We benchmark SmileyLlama by comparing it to CLMs trained from scratch on large amounts of ChEMBL data for their ability to generate valid and novel drug-like molecules. We also use direct preference optimization to both improve SmileyLlama's adherence to a prompt and to generate molecules within the iMiner reinforcement learning framework to predict new drug molecules with optimized 3D conformations and high binding affinity to drug targets, illustrated with the SARS-Cov-2 Main Protease. This overall framework allows a LLM to speak directly as a CLM which can generate molecules with user-specified properties, rather than acting only as a chatbot with knowledge of chemistry or as a helpful virtual assistant. While our dataset and analyses are geared toward drug discovery, this general procedure can be extended to other chemical applications such as chemical synthesis.
Related papers
- ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z) - Improving Chemical Understanding of LLMs via SMILES Parsing [18.532188836688928]
CLEANMOL is a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks.<n>We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks.<n>Our results show that CLEANMOL not only enhances structural comprehension but also achieves the best or competes with the baseline on the Mol-Instructions benchmark.
arXiv Detail & Related papers (2025-05-22T07:54:39Z) - ChemMLLM: Chemical Multimodal Large Language Model [52.95382215206681]
We propose ChemMLLM, a unified chemical multimodal large language model for molecule understanding and generation.<n>Also, we design five multimodal tasks across text, molecular SMILES strings, and image, and curate the datasets.<n> Experimental results show that ChemMLLM achieves superior performance across all evaluated tasks.
arXiv Detail & Related papers (2025-05-22T07:32:17Z) - mCLM: A Function-Infused and Synthesis-Friendly Modular Chemical Language Model [65.69164455183956]
We propose mCLM, a modular Chemical-Language Model tokenizing molecules into building blocks and learning a bilingual language model of both natural language descriptions of functions and molecule building blocks.<n>In experiments on 430 FDA-approved drugs, we find mCLM capable of significantly improving 5 out of 6 chemical functions critical to determining drug potentials.
arXiv Detail & Related papers (2025-05-18T22:52:39Z) - FARM: Functional Group-Aware Representations for Small Molecules [55.281754551202326]
We introduce Functional Group-Aware Representations for Small Molecules (FARM)<n>FARM is a novel model designed to bridge the gap between SMILES, natural language, and molecular graphs.<n>We evaluate FARM on the MoleculeNet dataset, where it achieves state-of-the-art performance on 11 out of 13 tasks.
arXiv Detail & Related papers (2024-10-02T23:04:58Z) - Many-Shot In-Context Learning for Molecular Inverse Design [56.65345962071059]
Large Language Models (LLMs) have demonstrated great performance in few-shot In-Context Learning (ICL)
We develop a new semi-supervised learning method that overcomes the lack of experimental data available for many-shot ICL.
As we show, the new method greatly improves upon existing ICL methods for molecular design while being accessible and easy to use for scientists.
arXiv Detail & Related papers (2024-07-26T21:10:50Z) - MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension [34.586861881519134]
Large Language Models (LLMs) with their strong task-handling capabilities have shown remarkable advancements across a spectrum of fields.
This study seeks to enhance the ability of LLMs to comprehend molecules by equipping them with a multi-modal external module, namely MolX.
In particular, instead of directly using a SMILES string to represent a molecule, we utilize specific encoders to extract fine-grained features from both SMILES string and 2D molecular graph representations.
arXiv Detail & Related papers (2024-06-10T20:25:18Z) - DrugLLM: Open Large Language Model for Few-shot Molecule Generation [20.680942401843772]
DrugLLM learns how to modify molecules in drug discovery by predicting the next molecule based on past modifications.
In computational experiments, DrugLLM can generate new molecules with expected properties based on limited examples.
arXiv Detail & Related papers (2024-05-07T09:18:13Z) - Benchmarking Large Language Models for Molecule Prediction Tasks [7.067145619709089]
Large Language Models (LLMs) stand at the forefront of a number of Natural Language Processing (NLP) tasks.
This paper explores a fundamental question: Can LLMs effectively handle molecule prediction tasks?
We identify several classification and regression prediction tasks across six standard molecule datasets.
We compare their performance with existing Machine Learning (ML) models, which include text-based models and those specifically designed for analysing the geometric structure of molecules.
arXiv Detail & Related papers (2024-03-08T05:59:56Z) - Large Language Models are In-Context Molecule Learners [22.06735237464927]
We propose In-Context Molecule Adaptation (ICMA), as a new paradigm allowing LLMs to learn the molecule-text alignment from context examples.
ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning.
We show that ICMT can empower LLMs to achieve state-of-the-art or comparable performance without extra training corpora and intricate structures.
arXiv Detail & Related papers (2024-03-07T03:58:28Z) - Knowledge Fusion of Large Language Models [73.28202188100646]
This paper introduces the notion of knowledge fusion for large language models (LLMs)
We externalize their collective knowledge and unique strengths, thereby elevating the capabilities of the target model beyond those of any individual source LLM.
Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation.
arXiv Detail & Related papers (2024-01-19T05:02:46Z) - Let Models Speak Ciphers: Multiagent Debate through Embeddings [84.20336971784495]
We introduce CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue.
By deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights.
This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
arXiv Detail & Related papers (2023-10-10T03:06:38Z) - Can Large Language Models Empower Molecular Property Prediction? [16.5246941211725]
Molecular property prediction has gained significant attention due to its transformative potential in scientific disciplines.
Recently, the rapid development of Large Language Models (LLMs) has revolutionized the field of NLP.
In this work, we advance towards this objective through two perspectives: zero/few-shot molecular classification, and using the new explanations generated by LLMs as representations of molecules.
arXiv Detail & Related papers (2023-07-14T16:06:42Z) - Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks.
In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation.
We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.