InstructMol: Multi-Modal Integration for Building a Versatile and
Reliable Molecular Assistant in Drug Discovery
- URL: http://arxiv.org/abs/2311.16208v1
- Date: Mon, 27 Nov 2023 16:47:51 GMT
- Title: InstructMol: Multi-Modal Integration for Building a Versatile and
Reliable Molecular Assistant in Drug Discovery
- Authors: He Cao, Zijing Liu, Xingyu Lu, Yuan Yao, Yu Li
- Abstract summary: Large Language Models (LLMs) offer promise in reshaping interactions with complex molecular data.
Our novel contribution, InstructMol, effectively aligns molecular structures with natural language via an instruction-tuning approach.
InstructMol showcases substantial performance improvements in drug discovery-related molecular tasks.
- Score: 19.870192393785043
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid evolution of artificial intelligence in drug discovery encounters
challenges with generalization and extensive training, yet Large Language
Models (LLMs) offer promise in reshaping interactions with complex molecular
data. Our novel contribution, InstructMol, a multi-modal LLM, effectively
aligns molecular structures with natural language via an instruction-tuning
approach, utilizing a two-stage training strategy that adeptly combines limited
domain-specific data with molecular and textual information. InstructMol
showcases substantial performance improvements in drug discovery-related
molecular tasks, surpassing leading LLMs and significantly reducing the gap
with specialized models, thereby establishing a robust foundation for a
versatile and dependable drug discovery assistant.
Related papers
- DrugR: Optimizing Molecular Drugs through LLM-based Explicit Reasoning [24.70952870676648]
DrugR is a large language model that introduces explicit, step-by-step pharmacological reasoning into the optimization process.<n>Our approach integrates domain-specific continual pretraining, supervised fine-tuning via reverse data engineering, and self-balanced multi-granular reinforcement learning.<n> Experimental results demonstrate that DrugR achieves comprehensive enhancement across multiple properties without compromising structural similarity or target binding affinity.
arXiv Detail & Related papers (2026-02-09T02:26:25Z) - Agentic reinforcement learning empowers next-generation chemical language models for molecular design and synthesis [51.83339196548892]
ChemCraft is a novel framework that decouples chemical reasoning from knowledge storage.<n>ChemCraft achieves superior performance with minimal inference costs.<n>This work establishes a cost-effective and privacy-preserving paradigm for AI-aided chemistry.
arXiv Detail & Related papers (2026-01-25T04:23:34Z) - Improving Large Molecular Language Model via Relation-aware Multimodal Collaboration [34.099746438477816]
We propose CoLLaMo, a large language model-based molecular assistant equipped with a multi-level molecular modality-collaborative projector.<n>Our experiments demonstrate that our CoLLaMo enhances the molecular modality generalization capabilities of LMLMs.
arXiv Detail & Related papers (2026-01-18T04:38:19Z) - Enhancing Molecular Property Prediction with Knowledge from Large Language Models [15.273538257961905]
We propose a novel framework that integrates knowledge extracted from large language models with structural features derived from pre-trained molecular models to enhance molecular property prediction.<n>Our approach prompts LLMs to generate both domain-relevant knowledge and executable code for molecular vectorization, producing knowledge-based features that are subsequently fused with structural representations.
arXiv Detail & Related papers (2025-09-25T01:48:54Z) - $\ ext{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models [59.125833618091846]
We propose a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view.<n>Experiments demonstrate that $textM2$LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks.
arXiv Detail & Related papers (2025-08-12T05:46:47Z) - PharmAgents: Building a Virtual Pharma with Large Language Model Agents [19.589707628042422]
We introduce PharmAgents, a virtual pharmaceutical ecosystem driven by multi-agent collaboration.
The system integrates explainable, LLM-driven agents equipped with specialized machine learning models and computational tools.
It identifies potential therapeutic targets, discovers promising lead compounds, enhances binding affinity and key molecular properties, and performs in silico analyses of toxicity and synthetic feasibility.
arXiv Detail & Related papers (2025-03-28T06:02:53Z) - Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [51.316001071698224]
We introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset.
This dataset can bridge the gap between large language models (LLMs) and complex biological sequences-related tasks.
We also develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline.
arXiv Detail & Related papers (2024-12-26T12:12:23Z) - Y-Mol: A Multiscale Biomedical Knowledge-Guided Large Language Model for Drug Development [24.5979645373074]
Y-Mol is a knowledge-guided LLM designed to accomplish tasks across lead compound discovery, pre-clinic, and clinic prediction.
It learns from a corpus of publications, knowledge graphs, and expert-designed synthetic data.
Y-Mol significantly outperforms general-purpose LLMs in discovering lead compounds, predicting molecular properties, and identifying drug interaction events.
arXiv Detail & Related papers (2024-10-15T12:39:20Z) - FARM: Functional Group-Aware Representations for Small Molecules [55.281754551202326]
We introduce Functional Group-Aware Representations for Small Molecules (FARM)
FARM is a foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs.
We rigorously evaluate FARM on the MoleculeNet dataset, where it achieves state-of-the-art performance on 10 out of 12 tasks.
arXiv Detail & Related papers (2024-10-02T23:04:58Z) - Many-Shot In-Context Learning for Molecular Inverse Design [56.65345962071059]
Large Language Models (LLMs) have demonstrated great performance in few-shot In-Context Learning (ICL)
We develop a new semi-supervised learning method that overcomes the lack of experimental data available for many-shot ICL.
As we show, the new method greatly improves upon existing ICL methods for molecular design while being accessible and easy to use for scientists.
arXiv Detail & Related papers (2024-07-26T21:10:50Z) - MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension [34.586861881519134]
Large Language Models (LLMs) with their strong task-handling capabilities have shown remarkable advancements across a spectrum of fields.
This study seeks to enhance the ability of LLMs to comprehend molecules by equipping them with a multi-modal external module, namely MolX.
In particular, instead of directly using a SMILES string to represent a molecule, we utilize specific encoders to extract fine-grained features from both SMILES string and 2D molecular graph representations.
arXiv Detail & Related papers (2024-06-10T20:25:18Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - A quantitative analysis of knowledge-learning preferences in large language models in molecular science [24.80165173525286]
Large language models (LLMs) introduce a fresh research paradigm to tackle scientific problems from a natural language processing (NLP) perspective.
LLMs significantly enhance our understanding and generation of molecules, often surpassing existing methods with their capabilities to decode and synthesize complex molecular patterns.
We propose a multi-modal benchmark, named ChEBI-20-MM, and perform 1263 experiments to assess the model's compatibility with data modalities and knowledge acquisition.
arXiv Detail & Related papers (2024-02-06T16:12:36Z) - MolTC: Towards Molecular Relational Modeling In Language Models [28.960416816491392]
We propose a novel framework for Molecular inTeraction prediction following Chain-of-Thought (CoT) theory termed MolTC.
Our experiments, conducted across various datasets involving over 4,000,000 molecular pairs, exhibit the superiority of our method over current GNN and LLM-based baselines.
arXiv Detail & Related papers (2024-02-06T07:51:56Z) - From molecules to scaffolds to functional groups: building context-dependent molecular representation via multi-channel learning [10.025809630976065]
This paper introduces a novel pre-training framework that learns robust and generalizable chemical knowledge.
Our approach demonstrates competitive performance across various molecular property benchmarks.
arXiv Detail & Related papers (2023-11-05T23:47:52Z) - Interactive Molecular Discovery with Natural Language [69.89287960545903]
We propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules.
To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information.
arXiv Detail & Related papers (2023-06-21T02:05:48Z) - Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks.
In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation.
We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.