ChemBERTa-2: Towards Chemical Foundation Models
- URL: http://arxiv.org/abs/2209.01712v1
- Date: Mon, 5 Sep 2022 00:31:12 GMT
- Title: ChemBERTa-2: Towards Chemical Foundation Models
- Authors: Walid Ahmad, Elana Simon, Seyone Chithrananda, Gabriel Grand, Bharath
Ramsundar
- Abstract summary: We build a chemical foundation model, ChemBERTa-2, using the language of SMILES.
In this work, we build upon ChemBERTa by optimizing the pretraining process.
To our knowledge, the 77M set constitutes one of the largest datasets used for molecular pretraining to date.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large pretrained models such as GPT-3 have had tremendous impact on modern
natural language processing by leveraging self-supervised learning to learn
salient representations that can be used to readily finetune on a wide variety
of downstream tasks. We investigate the possibility of transferring such
advances to molecular machine learning by building a chemical foundation model,
ChemBERTa-2, using the language of SMILES. While labeled data for molecular
prediction tasks is typically scarce, libraries of SMILES strings are readily
available. In this work, we build upon ChemBERTa by optimizing the pretraining
process. We compare multi-task and self-supervised pretraining by varying
hyperparameters and pretraining dataset size, up to 77M compounds from PubChem.
To our knowledge, the 77M set constitutes one of the largest datasets used for
molecular pretraining to date. We find that with these pretraining
improvements, we are competitive with existing state-of-the-art architectures
on the MoleculeNet benchmark suite. We analyze the degree to which improvements
in pretraining translate to improvement on downstream tasks.
Related papers
- Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - Instruction Pre-Training: Language Models are Supervised Multitask Learners [115.95022434390181]
In this paper, we propose a framework that augments massive raw corpora with instruction-response pairs to pre-train language models (LMs)
In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training.
arXiv Detail & Related papers (2024-06-20T16:55:33Z) - GP-MoLFormer: A Foundation Model For Molecular Generation [31.569161097828893]
We extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work.
Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1B chemical SMILES.
We find GP-MoLFormer is able to generate a significant fraction of novel, valid, and unique SMILES even when the number of generated molecules is in the 10 billion range and the reference set is over a billion.
arXiv Detail & Related papers (2024-04-04T16:20:06Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - MolXPT: Wrapping Molecules with Text for Generative Pre-training [141.0924452870112]
MolXPT is a unified language model of text and molecules pre-trained on SMILES wrapped by text.
MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet.
arXiv Detail & Related papers (2023-05-18T03:58:19Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - MolE: a molecular foundation model for drug discovery [0.2802437011072858]
MolE is a molecular foundation model that adapts the DeBERTa architecture to be used on molecular graphs.
We show that fine-tuning pretrained MolE achieves state-of-the-art results on 9 of the 22 ADMET tasks included in the Therapeutic Data Commons.
arXiv Detail & Related papers (2022-11-03T21:22:05Z) - Improving Molecular Representation Learning with Metric
Learning-enhanced Optimal Transport [49.237577649802034]
We develop a novel optimal transport-based algorithm termed MROT to enhance their generalization capability for molecular regression problems.
MROT significantly outperforms state-of-the-art models, showing promising potential in accelerating the discovery of new substances.
arXiv Detail & Related papers (2022-02-13T04:56:18Z) - Do Large Scale Molecular Language Representations Capture Important
Structural Information? [31.76876206167457]
We present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer.
Experiments show that the learned molecular representation performs competitively, when compared to graph-based and fingerprint-based supervised learning baselines.
arXiv Detail & Related papers (2021-06-17T14:33:55Z) - ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular
Property Prediction [0.0]
In NLP, transformers have become the de-facto standard for representation learning thanks to their strong downstream task transfer.
We make one of the first attempts to systematically evaluate transformers on molecular property prediction tasks via our ChemBERTa model.
Our results suggest that transformers offer a promising avenue of future work for molecular representation learning and property prediction.
arXiv Detail & Related papers (2020-10-19T21:41:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.