Infusing Linguistic Knowledge of SMILES into Chemical Language Models
- URL: http://arxiv.org/abs/2205.00084v1
- Date: Wed, 20 Apr 2022 01:25:18 GMT
- Title: Infusing Linguistic Knowledge of SMILES into Chemical Language Models
- Authors: Ingoo Lee and Hojung Nam
- Abstract summary: We grammatically parsed SMILES to obtain connectivity between substructures and their type, which is called the grammatical knowledge of SMILES.
Our representation model outperformed previous compound representations for the prediction of molecular properties.
- Score: 0.3655021726150368
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The simplified molecular-input line-entry system (SMILES) is the most popular
representation of chemical compounds. Therefore, many SMILES-based molecular
property prediction models have been developed. In particular,
transformer-based models show promising performance because the model utilizes
a massive chemical dataset for self-supervised learning. However, there is no
transformer-based model to overcome the inherent limitations of SMILES, which
result from the generation process of SMILES. In this study, we grammatically
parsed SMILES to obtain connectivity between substructures and their type,
which is called the grammatical knowledge of SMILES. First, we pretrained the
transformers with substructural tokens, which were parsed from SMILES. Then, we
used the training strategy 'same compound model' to better understand SMILES
grammar. In addition, we injected knowledge of connectivity and type into the
transformer with knowledge adapters. As a result, our representation model
outperformed previous compound representations for the prediction of molecular
properties. Finally, we analyzed the attention of the transformer model and
adapters, demonstrating that the proposed model understands the grammar of
SMILES.
Related papers
- Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction [14.353313239109337]
MolTRES is a novel chemical language representation learning framework.
It incorporates generator-discriminator training, allowing the model to learn from more challenging examples.
Our model outperforms existing state-of-the-art models on popular molecular property prediction tasks.
arXiv Detail & Related papers (2024-07-09T01:14:28Z) - Transformers Can Represent $n$-gram Language Models [56.06361029539347]
We focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models.
We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM.
arXiv Detail & Related papers (2024-04-23T12:51:37Z) - Can Large Language Models Understand Molecules? [0.0699049312989311]
We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks.
We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks.
arXiv Detail & Related papers (2024-01-05T18:31:34Z) - MolXPT: Wrapping Molecules with Text for Generative Pre-training [141.0924452870112]
MolXPT is a unified language model of text and molecules pre-trained on SMILES wrapped by text.
MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet.
arXiv Detail & Related papers (2023-05-18T03:58:19Z) - Difficulty in chirality recognition for Transformer architectures
learning chemical structures from string [0.0]
We investigate the relationship between the learning progress of SMILES and chemical structure using a representative NLP model, the Transformer.
We show that while the Transformer learns partial structures of molecules quickly, it requires extended training to understand overall structures.
arXiv Detail & Related papers (2023-03-21T04:47:45Z) - Improving Rare Word Recognition with LM-aware MWER Training [50.241159623691885]
We introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework.
For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10% relative improvement.
For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner.
arXiv Detail & Related papers (2022-04-15T17:19:41Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z) - Predicting Chemical Properties using Self-Attention Multi-task Learning
based on SMILES Representation [0.0]
In this study, we explore the structural differences of the transformer-variant model and proposed a new self-attention based model.
The representation learning performance of the self-attention module was evaluated in a multi-task learning environment using imbalanced chemical datasets.
arXiv Detail & Related papers (2020-10-19T09:46:50Z) - Learning Latent Space Energy-Based Prior Model for Molecule Generation [59.875533935578375]
We learn latent space energy-based prior model with SMILES representation for molecule modeling.
Our method is able to generate molecules with validity and uniqueness competitive with state-of-the-art models.
arXiv Detail & Related papers (2020-10-19T09:34:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.