MolE: a molecular foundation model for drug discovery
- URL: http://arxiv.org/abs/2211.02657v1
- Date: Thu, 3 Nov 2022 21:22:05 GMT
- Title: MolE: a molecular foundation model for drug discovery
- Authors: Oscar M\'endez-Lucio, Christos Nicolaou, Berton Earnshaw
- Abstract summary: MolE is a molecular foundation model that adapts the DeBERTa architecture to be used on molecular graphs.
We show that fine-tuning pretrained MolE achieves state-of-the-art results on 9 of the 22 ADMET tasks included in the Therapeutic Data Commons.
- Score: 0.2802437011072858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Models that accurately predict properties based on chemical structure are
valuable tools in drug discovery. However, for many properties, public and
private training sets are typically small, and it is difficult for the models
to generalize well outside of the training data. Recently, large language
models have addressed this problem by using self-supervised pretraining on
large unlabeled datasets, followed by fine-tuning on smaller, labeled datasets.
In this paper, we report MolE, a molecular foundation model that adapts the
DeBERTa architecture to be used on molecular graphs together with a two-step
pretraining strategy. The first step of pretraining is a self-supervised
approach focused on learning chemical structures, and the second step is a
massive multi-task approach to learn biological information. We show that
fine-tuning pretrained MolE achieves state-of-the-art results on 9 of the 22
ADMET tasks included in the Therapeutic Data Commons.
Related papers
- Two-Stage Pretraining for Molecular Property Prediction in the Wild [38.31911435361748]
We introduce MoleVers, a versatile pretrained model designed for various types of molecular property prediction in the wild.
MoleVers learns representations from large unlabeled datasets via masked atom prediction and dynamic denoising.
In the second stage, MoleVers is further pretrained using auxiliary labels obtained with inexpensive computational methods.
arXiv Detail & Related papers (2024-11-05T22:36:17Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - A Large Encoder-Decoder Family of Foundation Models For Chemical Language [1.1073864511426255]
This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem.
Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks.
arXiv Detail & Related papers (2024-07-24T20:30:39Z) - Bi-level Contrastive Learning for Knowledge-Enhanced Molecule
Representations [55.42602325017405]
We propose a novel method called GODE, which takes into account the two-level structure of individual molecules.
By pre-training two graph neural networks (GNNs) on different graph structures, combined with contrastive learning, GODE fuses molecular structures with their corresponding knowledge graph substructures.
When fine-tuned across 11 chemical property tasks, our model outperforms existing benchmarks, registering an average ROC-AUC uplift of 13.8% for classification tasks and an average RMSE/MAE enhancement of 35.1% for regression tasks.
arXiv Detail & Related papers (2023-06-02T15:49:45Z) - MolCPT: Molecule Continuous Prompt Tuning to Generalize Molecular
Representation Learning [77.31492888819935]
We propose a novel paradigm of "pre-train, prompt, fine-tune" for molecular representation learning, named molecule continuous prompt tuning (MolCPT)
MolCPT defines a motif prompting function that uses the pre-trained model to project the standalone input into an expressive prompt.
Experiments on several benchmark datasets show that MolCPT efficiently generalizes pre-trained GNNs for molecular property prediction.
arXiv Detail & Related papers (2022-12-20T19:32:30Z) - Improving Molecular Pretraining with Complementary Featurizations [20.86159731100242]
Molecular pretraining is a paradigm to solve a variety of tasks in computational chemistry and drug discovery.
We show that different featurization techniques convey chemical information differently.
We propose a simple and effective MOlecular pretraining framework with COmplementary featurizations (MOCO)
arXiv Detail & Related papers (2022-09-29T21:11:09Z) - ChemBERTa-2: Towards Chemical Foundation Models [0.0]
We build a chemical foundation model, ChemBERTa-2, using the language of SMILES.
In this work, we build upon ChemBERTa by optimizing the pretraining process.
To our knowledge, the 77M set constitutes one of the largest datasets used for molecular pretraining to date.
arXiv Detail & Related papers (2022-09-05T00:31:12Z) - Retrieval-based Controllable Molecule Generation [63.44583084888342]
We propose a new retrieval-based framework for controllable molecule generation.
We use a small set of molecules to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria.
Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning.
arXiv Detail & Related papers (2022-08-23T17:01:16Z) - Few-Shot Graph Learning for Molecular Property Prediction [46.60746023179724]
We propose Meta-MGNN, a novel model for few-shot molecular property prediction.
To exploit unlabeled molecular information, Meta-MGNN further incorporates molecular structure, attribute based self-supervised modules and self-attentive task weights.
Extensive experiments on two public multi-property datasets demonstrate that Meta-MGNN outperforms a variety of state-of-the-art methods.
arXiv Detail & Related papers (2021-02-16T01:55:34Z) - Self-Supervised Graph Transformer on Large-Scale Molecular Data [73.3448373618865]
We propose a novel framework, GROVER, for molecular representation learning.
GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data.
We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning.
arXiv Detail & Related papers (2020-06-18T08:37:04Z) - A semi-supervised learning framework for quantitative structure-activity
regression modelling [0.0]
We show that it is possible to make predictions which take into account the similarity of the testing compounds to those in the training data and adjust for the reporting selection bias.
We illustrate this approach using publicly available structure-activity data on a large set of compounds reported by GlaxoSmithKline.
arXiv Detail & Related papers (2020-01-07T07:56:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.