SELFormer: Molecular Representation Learning via SELFIES Language Models
- URL: http://arxiv.org/abs/2304.04662v2
- Date: Thu, 25 May 2023 09:14:14 GMT
- Title: SELFormer: Molecular Representation Learning via SELFIES Language Models
- Authors: Atakan Y\"uksel, Erva Ulusoy, Atabey \"Unl\"u, Tunca Do\u{g}an
- Abstract summary: In this study, we propose SELFormer, a transformer architecture-based chemical language model.
SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks.
Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based chemical language models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated computational analysis of the vast chemical space is critical for
numerous fields of research such as drug discovery and material science.
Representation learning techniques have recently been employed with the primary
objective of generating compact and informative numerical expressions of
complex data. One approach to efficiently learn molecular representations is
processing string-based notations of chemicals via natural language processing
(NLP) algorithms. Majority of the methods proposed so far utilize SMILES
notations for this purpose; however, SMILES is associated with numerous
problems related to validity and robustness, which may prevent the model from
effectively uncovering the knowledge hidden in the data. In this study, we
propose SELFormer, a transformer architecture-based chemical language model
that utilizes a 100% valid, compact and expressive notation, SELFIES, as input,
in order to learn flexible and high-quality molecular representations.
SELFormer is pre-trained on two million drug-like compounds and fine-tuned for
diverse molecular property prediction tasks. Our performance evaluation has
revealed that, SELFormer outperforms all competing methods, including graph
learning-based approaches and SMILES-based chemical language models, on
predicting aqueous solubility of molecules and adverse drug reactions. We also
visualized molecular representations learned by SELFormer via dimensionality
reduction, which indicated that even the pre-trained model can discriminate
molecules with differing structural properties. We shared SELFormer as a
programmatic tool, together with its datasets and pre-trained models. Overall,
our research demonstrates the benefit of using the SELFIES notations in the
context of chemical language modeling and opens up new possibilities for the
design and discovery of novel drug candidates with desired features.
Related papers
- Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - FARM: Functional Group-Aware Representations for Small Molecules [55.281754551202326]
We introduce Functional Group-Aware Representations for Small Molecules (FARM)
FARM is a foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs.
We rigorously evaluate FARM on the MoleculeNet dataset, where it achieves state-of-the-art performance on 10 out of 12 tasks.
arXiv Detail & Related papers (2024-10-02T23:04:58Z) - Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models [0.0]
We systematically evaluate thirteen chemistry-specific tokenizers for their coverage of the SMILES language.
We introduce two new tokenizers, i>smirk/i> and i>smirk-gpe/i>, which can represent the entirety of the OpenSMILES specification.
arXiv Detail & Related papers (2024-09-19T02:36:04Z) - Crossing New Frontiers: Knowledge-Augmented Large Language Model Prompting for Zero-Shot Text-Based De Novo Molecule Design [0.0]
Our study explores the use of knowledge-augmented prompting of large language models (LLMs) for the zero-shot text-conditional de novo molecular generation task.
Our framework proves effective, outperforming state-of-the-art (SOTA) baseline models on benchmark datasets.
arXiv Detail & Related papers (2024-08-18T11:37:19Z) - MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction [14.353313239109337]
MolTRES is a novel chemical language representation learning framework.
It incorporates generator-discriminator training, allowing the model to learn from more challenging examples.
Our model outperforms existing state-of-the-art models on popular molecular property prediction tasks.
arXiv Detail & Related papers (2024-07-09T01:14:28Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z) - Model-agnostic multi-objective approach for the evolutionary discovery
of mathematical models [55.41644538483948]
In modern data science, it is more interesting to understand the properties of the model, which parts could be replaced to obtain better results.
We use multi-objective evolutionary optimization for composite data-driven model learning to obtain the algorithm's desired properties.
arXiv Detail & Related papers (2021-07-07T11:17:09Z) - Do Large Scale Molecular Language Representations Capture Important
Structural Information? [31.76876206167457]
We present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer.
Experiments show that the learned molecular representation performs competitively, when compared to graph-based and fingerprint-based supervised learning baselines.
arXiv Detail & Related papers (2021-06-17T14:33:55Z) - Reprogramming Language Models for Molecular Representation Learning [65.00999660425731]
We propose Representation Reprogramming via Dictionary Learning (R2DL) for adversarially reprogramming pretrained language models for molecular learning tasks.
The adversarial program learns a linear transformation between a dense source model input space (language data) and a sparse target model input space (e.g., chemical and biological molecule data) using a k-SVD solver.
R2DL achieves the baseline established by state of the art toxicity prediction models trained on domain-specific data and outperforms the baseline in a limited training-data setting.
arXiv Detail & Related papers (2020-12-07T05:50:27Z) - Predicting Chemical Properties using Self-Attention Multi-task Learning
based on SMILES Representation [0.0]
In this study, we explore the structural differences of the transformer-variant model and proposed a new self-attention based model.
The representation learning performance of the self-attention module was evaluated in a multi-task learning environment using imbalanced chemical datasets.
arXiv Detail & Related papers (2020-10-19T09:46:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.