SELFormer: Molecular Representation Learning via SELFIES Language Models
- URL: http://arxiv.org/abs/2304.04662v2
- Date: Thu, 25 May 2023 09:14:14 GMT
- Title: SELFormer: Molecular Representation Learning via SELFIES Language Models
- Authors: Atakan Y\"uksel, Erva Ulusoy, Atabey \"Unl\"u, Tunca Do\u{g}an
- Abstract summary: In this study, we propose SELFormer, a transformer architecture-based chemical language model.
SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks.
Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based chemical language models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated computational analysis of the vast chemical space is critical for
numerous fields of research such as drug discovery and material science.
Representation learning techniques have recently been employed with the primary
objective of generating compact and informative numerical expressions of
complex data. One approach to efficiently learn molecular representations is
processing string-based notations of chemicals via natural language processing
(NLP) algorithms. Majority of the methods proposed so far utilize SMILES
notations for this purpose; however, SMILES is associated with numerous
problems related to validity and robustness, which may prevent the model from
effectively uncovering the knowledge hidden in the data. In this study, we
propose SELFormer, a transformer architecture-based chemical language model
that utilizes a 100% valid, compact and expressive notation, SELFIES, as input,
in order to learn flexible and high-quality molecular representations.
SELFormer is pre-trained on two million drug-like compounds and fine-tuned for
diverse molecular property prediction tasks. Our performance evaluation has
revealed that, SELFormer outperforms all competing methods, including graph
learning-based approaches and SMILES-based chemical language models, on
predicting aqueous solubility of molecules and adverse drug reactions. We also
visualized molecular representations learned by SELFormer via dimensionality
reduction, which indicated that even the pre-trained model can discriminate
molecules with differing structural properties. We shared SELFormer as a
programmatic tool, together with its datasets and pre-trained models. Overall,
our research demonstrates the benefit of using the SELFIES notations in the
context of chemical language modeling and opens up new possibilities for the
design and discovery of novel drug candidates with desired features.
Related papers
- Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [50.756644656847165]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - Multi-Modal Representation Learning for Molecular Property Prediction:
Sequence, Graph, Geometry [6.049566024728809]
Deep learning-based molecular property prediction has emerged as a solution to the resource-intensive nature of traditional methods.
In this paper, we propose a novel multi-modal representation learning model, called SGGRL, for molecular property prediction.
To ensure consistency across modalities, SGGRL is trained to maximize the similarity of representations for the same molecule while minimizing similarity for different molecules.
arXiv Detail & Related papers (2024-01-07T02:18:00Z) - Structure to Property: Chemical Element Embeddings and a Deep Learning
Approach for Accurate Prediction of Chemical Properties [0.0]
This paper introduces a new machine learning model based on deep learning techniques, such as a multilayer encoder and decoder architecture, for classification tasks.
We demonstrate the opportunities offered by our approach by applying it to various types of input data, including organic and inorganic compounds.
The models used in this work exhibit a high degree of predictive power, underscoring the progress that can be made with refined machine learning.
arXiv Detail & Related papers (2023-09-17T19:41:32Z) - Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks.
In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation.
We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z) - Improving VAE based molecular representations for compound property
prediction [0.0]
We propose a simple method to improve chemical property prediction performance of machine learning models.
We show the relation between the performance of property prediction models and the distance between property prediction dataset and the larger unlabeled dataset.
arXiv Detail & Related papers (2022-01-13T12:57:11Z) - Model-agnostic multi-objective approach for the evolutionary discovery
of mathematical models [55.41644538483948]
In modern data science, it is more interesting to understand the properties of the model, which parts could be replaced to obtain better results.
We use multi-objective evolutionary optimization for composite data-driven model learning to obtain the algorithm's desired properties.
arXiv Detail & Related papers (2021-07-07T11:17:09Z) - Do Large Scale Molecular Language Representations Capture Important
Structural Information? [31.76876206167457]
We present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer.
Experiments show that the learned molecular representation performs competitively, when compared to graph-based and fingerprint-based supervised learning baselines.
arXiv Detail & Related papers (2021-06-17T14:33:55Z) - Reprogramming Language Models for Molecular Representation Learning [65.00999660425731]
We propose Representation Reprogramming via Dictionary Learning (R2DL) for adversarially reprogramming pretrained language models for molecular learning tasks.
The adversarial program learns a linear transformation between a dense source model input space (language data) and a sparse target model input space (e.g., chemical and biological molecule data) using a k-SVD solver.
R2DL achieves the baseline established by state of the art toxicity prediction models trained on domain-specific data and outperforms the baseline in a limited training-data setting.
arXiv Detail & Related papers (2020-12-07T05:50:27Z) - Predicting Chemical Properties using Self-Attention Multi-task Learning
based on SMILES Representation [0.0]
In this study, we explore the structural differences of the transformer-variant model and proposed a new self-attention based model.
The representation learning performance of the self-attention module was evaluated in a multi-task learning environment using imbalanced chemical datasets.
arXiv Detail & Related papers (2020-10-19T09:46:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.