SMI-Editor: Edit-based SMILES Language Model with Fragment-level Supervision
- URL: http://arxiv.org/abs/2412.05569v1
- Date: Sat, 07 Dec 2024 07:32:55 GMT
- Title: SMI-Editor: Edit-based SMILES Language Model with Fragment-level Supervision
- Authors: Kangjie Zheng, Siyue Liang, Junwei Yang, Bin Feng, Zequn Liu, Wei Ju, Zhiping Xiao, Ming Zhang,
- Abstract summary: We propose SMI-Editor, a novel edit-based pre-trained SMILES LM.
SMI-Editor disrupts substructures within a molecule at random and feeds the resulting SMILES back into the model.
It then attempts to restore the original SMILES through an editing process.
Experimental results show that SMI-Editor achieves state-of-the-art performance across multiple downstream molecular tasks.
- Score: 11.997499811414837
- License:
- Abstract: SMILES, a crucial textual representation of molecular structures, has garnered significant attention as a foundation for pre-trained language models (LMs). However, most existing pre-trained SMILES LMs focus solely on the single-token level supervision during pre-training, failing to fully leverage the substructural information of molecules. This limitation makes the pre-training task overly simplistic, preventing the models from capturing richer molecular semantic information. Moreover, during pre-training, these SMILES LMs only process corrupted SMILES inputs, never encountering any valid SMILES, which leads to a train-inference mismatch. To address these challenges, we propose SMI-Editor, a novel edit-based pre-trained SMILES LM. SMI-Editor disrupts substructures within a molecule at random and feeds the resulting SMILES back into the model, which then attempts to restore the original SMILES through an editing process. This approach not only introduces fragment-level training signals, but also enables the use of valid SMILES as inputs, allowing the model to learn how to reconstruct complete molecules from these incomplete structures. As a result, the model demonstrates improved scalability and an enhanced ability to capture fragment-level molecular information. Experimental results show that SMI-Editor achieves state-of-the-art performance across multiple downstream molecular tasks, and even outperforming several 3D molecular representation models.
Related papers
- Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes.
Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts.
We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z) - Large Language Models are In-Context Molecule Learners [22.06735237464927]
We propose In-Context Molecule Adaptation (ICMA), as a new paradigm allowing LLMs to learn the molecule-text alignment from context examples.
ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning.
We show that ICMT can empower LLMs to achieve state-of-the-art or comparable performance without extra training corpora and intricate structures.
arXiv Detail & Related papers (2024-03-07T03:58:28Z) - Can Large Language Models Understand Molecules? [0.0699049312989311]
We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks.
We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks.
arXiv Detail & Related papers (2024-01-05T18:31:34Z) - Improving Input-label Mapping with Demonstration Replay for In-context
Learning [67.57288926736923]
In-context learning (ICL) is an emerging capability of large autoregressive language models.
We propose a novel ICL method called Sliding Causal Attention (RdSca)
We show that our method significantly improves the input-label mapping in ICL demonstrations.
arXiv Detail & Related papers (2023-10-30T14:29:41Z) - Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks.
In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation.
We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z) - MolXPT: Wrapping Molecules with Text for Generative Pre-training [141.0924452870112]
MolXPT is a unified language model of text and molecules pre-trained on SMILES wrapped by text.
MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet.
arXiv Detail & Related papers (2023-05-18T03:58:19Z) - t-SMILES: A Scalable Fragment-based Molecular Representation Framework for De Novo Molecule Generation [9.116670221263753]
This study introduces a flexible, fragment-based, multiscale molecular representation framework called t-SMILES.
It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a full binary tree formed from a fragmented molecular graph.
It significantly outperforms classical SMILES, DeepSMILES, SELFIES and baseline models in goal-directed tasks.
arXiv Detail & Related papers (2023-01-04T21:41:01Z) - From Cloze to Comprehension: Retrofitting Pre-trained Masked Language
Model to Pre-trained Machine Reader [130.45769668885487]
Pre-trained Machine Reader (PMR) is a novel method for retrofitting masked language models (MLMs) to pre-trained machine reading comprehension (MRC) models without acquiring labeled data.
To build the proposed PMR, we constructed a large volume of general-purpose and high-quality MRC-style training data.
PMR has the potential to serve as a unified model for tackling various extraction and classification tasks in the MRC formulation.
arXiv Detail & Related papers (2022-12-09T10:21:56Z) - Infusing Linguistic Knowledge of SMILES into Chemical Language Models [0.3655021726150368]
We grammatically parsed SMILES to obtain connectivity between substructures and their type, which is called the grammatical knowledge of SMILES.
Our representation model outperformed previous compound representations for the prediction of molecular properties.
arXiv Detail & Related papers (2022-04-20T01:25:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.