BARTSmiles: Generative Masked Language Models for Molecular
Representations
- URL: http://arxiv.org/abs/2211.16349v1
- Date: Tue, 29 Nov 2022 16:30:53 GMT
- Title: BARTSmiles: Generative Masked Language Models for Molecular
Representations
- Authors: Gayane Chilingaryan, Hovhannes Tamoyan, Ani Tevosyan, Nelly Babayan,
Lusine Khondkaryan, Karen Hambardzumyan, Zaven Navoyan, Hrant Khachatrian,
Armen Aghajanyan
- Abstract summary: We train BARTSmiles, a BART-like model with an order of magnitude more compute than previous self-supervised molecular representations.
In-depth evaluations show that BARTSmiles consistently outperforms other self-supervised representations across classification, regression, and generation tasks.
- Score: 10.012900591467938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We discover a robust self-supervised strategy tailored towards molecular
representations for generative masked language models through a series of
tailored, in-depth ablations. Using this pre-training strategy, we train
BARTSmiles, a BART-like model with an order of magnitude more compute than
previous self-supervised molecular representations. In-depth evaluations show
that BARTSmiles consistently outperforms other self-supervised representations
across classification, regression, and generation tasks setting a new
state-of-the-art on 11 tasks. We then quantitatively show that when applied to
the molecular domain, the BART objective learns representations that implicitly
encode our downstream tasks of interest. For example, by selecting seven
neurons from a frozen BARTSmiles, we can obtain a model having performance
within two percentage points of the full fine-tuned model on task Clintox.
Lastly, we show that standard attribution interpretability methods, when
applied to BARTSmiles, highlight certain substructures that chemists use to
explain specific properties of molecules. The code and the pretrained model are
publicly available.
Related papers
- Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - Multi-Modal Representation Learning for Molecular Property Prediction:
Sequence, Graph, Geometry [6.049566024728809]
Deep learning-based molecular property prediction has emerged as a solution to the resource-intensive nature of traditional methods.
In this paper, we propose a novel multi-modal representation learning model, called SGGRL, for molecular property prediction.
To ensure consistency across modalities, SGGRL is trained to maximize the similarity of representations for the same molecule while minimizing similarity for different molecules.
arXiv Detail & Related papers (2024-01-07T02:18:00Z) - AdaMR: Adaptable Molecular Representation for Unified Pre-training Strategy [11.710702202071573]
We propose a new large-scale uniform pre-training strategy for small-molecule drugs, called Molecular Adjustable Representation (AdaMR)
AdaMR utilizes a granularity-adjustable molecular encoding strategy, which is accomplished through a pre-training job termed molecular canonicalization.
We fine-tuned our proposed pre-trained model on six molecular property prediction tasks and two generative tasks, achieving state-of-the-art (SOTA) results on five out of eight tasks.
arXiv Detail & Related papers (2023-12-28T10:53:17Z) - MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph Structures [2.5563339057415218]
MolIG is a novel MultiModaL molecular pre-training framework for predicting molecular properties based on Image and Graph structures.
It amalgamates the strengths of both molecular representation forms.
It exhibits enhanced performance in downstream tasks pertaining to molecular property prediction within benchmark groups.
arXiv Detail & Related papers (2023-11-28T10:28:35Z) - Molecule Design by Latent Space Energy-Based Modeling and Gradual
Distribution Shifting [53.44684898432997]
Generation of molecules with desired chemical and biological properties is critical for drug discovery.
We propose a probabilistic generative model to capture the joint distribution of molecules and their properties.
Our method achieves very strong performances on various molecule design tasks.
arXiv Detail & Related papers (2023-06-09T03:04:21Z) - Atomic and Subgraph-aware Bilateral Aggregation for Molecular
Representation Learning [57.670845619155195]
We introduce a new model for molecular representation learning called the Atomic and Subgraph-aware Bilateral Aggregation (ASBA)
ASBA addresses the limitations of previous atom-wise and subgraph-wise models by incorporating both types of information.
Our method offers a more comprehensive way to learn representations for molecular property prediction and has broad potential in drug and material discovery applications.
arXiv Detail & Related papers (2023-05-22T00:56:00Z) - t-SMILES: A Scalable Fragment-based Molecular Representation Framework for De Novo Molecule Generation [9.116670221263753]
This study introduces a flexible, fragment-based, multiscale molecular representation framework called t-SMILES.
It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a full binary tree formed from a fragmented molecular graph.
It significantly outperforms classical SMILES, DeepSMILES, SELFIES and baseline models in goal-directed tasks.
arXiv Detail & Related papers (2023-01-04T21:41:01Z) - MolCPT: Molecule Continuous Prompt Tuning to Generalize Molecular
Representation Learning [77.31492888819935]
We propose a novel paradigm of "pre-train, prompt, fine-tune" for molecular representation learning, named molecule continuous prompt tuning (MolCPT)
MolCPT defines a motif prompting function that uses the pre-trained model to project the standalone input into an expressive prompt.
Experiments on several benchmark datasets show that MolCPT efficiently generalizes pre-trained GNNs for molecular property prediction.
arXiv Detail & Related papers (2022-12-20T19:32:30Z) - Exploring Target Representations for Masked Autoencoders [78.57196600585462]
We show that a careful choice of the target representation is unnecessary for learning good representations.
We propose a multi-stage masked distillation pipeline and use a randomly model as the teacher.
A proposed method to perform masked knowledge distillation with bootstrapped teachers (dBOT) outperforms previous self-supervised methods by nontrivial margins.
arXiv Detail & Related papers (2022-09-08T16:55:19Z) - Goal-directed Generation of Discrete Structures with Conditional
Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward.
We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.