Related papers: Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs

Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs

URL: http://arxiv.org/abs/2401.07657v1
Date: Mon, 15 Jan 2024 12:53:58 GMT
Title: Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs
Authors: Xiuyuan Hu, Guoqing Liu, Yang Zhao, Hao Zhang
Abstract summary: We investigate whether and how language models understand the chemical spatial structure from 1D sequences. The results indicate that language models can understand chemical structures from the perspective of molecular fragments.
Score: 16.508471997999496
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: AI for drug discovery has been a research hotspot in recent years, and SMILES-based language models has been increasingly applied in drug molecular design. However, no work has explored whether and how language models understand the chemical spatial structure from 1D sequences. In this work, we pre-train a transformer model on chemical language and fine-tune it toward drug design objectives, and investigate the correspondence between high-frequency SMILES substrings and molecular fragments. The results indicate that language models can understand chemical structures from the perspective of molecular fragments, and the structural knowledge learned through fine-tuning is reflected in the high-frequency SMILES substrings generated by the model.

Related papers

Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model [55.87790704067848]
Mol-LLaMA is a large molecular language model that grasps the general knowledge centered on molecules. We introduce a module that integrates complementary information from different molecular encoders. Our experimental results demonstrate that Mol-LLaMA is capable of comprehending the general features of molecules.
arXiv Detail & Related papers (2025-02-19T05:49:10Z)
Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms. This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z)
FARM: Functional Group-Aware Representations for Small Molecules [55.281754551202326]
We introduce Functional Group-Aware Representations for Small Molecules (FARM) FARM is a foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. We rigorously evaluate FARM on the MoleculeNet dataset, where it achieves state-of-the-art performance on 10 out of 12 tasks.
arXiv Detail & Related papers (2024-10-02T23:04:58Z)
MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction [14.353313239109337]
MolTRES is a novel chemical language representation learning framework. It incorporates generator-discriminator training, allowing the model to learn from more challenging examples. Our model outperforms existing state-of-the-art models on popular molecular property prediction tasks.
arXiv Detail & Related papers (2024-07-09T01:14:28Z)
DrugLLM: Open Large Language Model for Few-shot Molecule Generation [20.680942401843772]
DrugLLM learns how to modify molecules in drug discovery by predicting the next molecule based on past modifications. In computational experiments, DrugLLM can generate new molecules with expected properties based on limited examples.
arXiv Detail & Related papers (2024-05-07T09:18:13Z)
From molecules to scaffolds to functional groups: building context-dependent molecular representation via multi-channel learning [10.025809630976065]
This paper introduces a novel pre-training framework that learns robust and generalizable chemical knowledge. Our approach demonstrates competitive performance across various molecular property benchmarks.
arXiv Detail & Related papers (2023-11-05T23:47:52Z)
Interactive Molecular Discovery with Natural Language [69.89287960545903]
We propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules. To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information.
arXiv Detail & Related papers (2023-06-21T02:05:48Z)
An Equivariant Generative Framework for Molecular Graph-Structure Co-Design [54.92529253182004]
We present MolCode, a machine learning-based generative framework for underlineMolecular graph-structure underlineCo-design. In MolCode, 3D geometric information empowers the molecular 2D graph generation, which in turn helps guide the prediction of molecular 3D structure. Our investigation reveals that the 2D topology and 3D geometry contain intrinsically complementary information in molecule design.
arXiv Detail & Related papers (2023-04-12T13:34:22Z)
Difficulty in chirality recognition for Transformer architectures learning chemical structures from string [0.0]
We investigate the relationship between the learning progress of SMILES and chemical structure using a representative NLP model, the Transformer. We show that while the Transformer learns partial structures of molecules quickly, it requires extended training to understand overall structures.
arXiv Detail & Related papers (2023-03-21T04:47:45Z)
Domain-Agnostic Molecular Generation with Chemical Feedback [44.063584808910896]
MolGen is a pre-trained molecular language model tailored specifically for molecule generation. It internalizes structural and grammatical insights through the reconstruction of over 100 million molecular SELFIES. Our chemical feedback paradigm steers the model away from molecular hallucinations, ensuring alignment between the model's estimated probabilities and real-world chemical preferences.
arXiv Detail & Related papers (2023-01-26T17:52:56Z)
A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data. We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z)
Learning Latent Space Energy-Based Prior Model for Molecule Generation [59.875533935578375]
We learn latent space energy-based prior model with SMILES representation for molecule modeling. Our method is able to generate molecules with validity and uniqueness competitive with state-of-the-art models.
arXiv Detail & Related papers (2020-10-19T09:34:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.