Uni-Mol2: Exploring Molecular Pretraining Model at Scale
- URL: http://arxiv.org/abs/2406.14969v2
- Date: Mon, 1 Jul 2024 09:08:44 GMT
- Title: Uni-Mol2: Exploring Molecular Pretraining Model at Scale
- Authors: Xiaohong Ji, Zhen Wang, Zhifeng Gao, Hang Zheng, Linfeng Zhang, Guolin Ke, Weinan E,
- Abstract summary: We present Uni-Mol2, an innovative molecular pretraining model that integrates features at the atomic level, graph level, and geometry structure level.
We successfully scale Uni-Mol2 to 1.1 billion parameters through pretraining on 800 million conformations, making it the largest molecular pretraining model to date.
- Score: 27.172011090947823
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In recent years, pretraining models have made significant advancements in the fields of natural language processing (NLP), computer vision (CV), and life sciences. The significant advancements in NLP and CV are predominantly driven by the expansion of model parameters and data size, a phenomenon now recognized as the scaling laws. However, research exploring scaling law in molecular pretraining models remains unexplored. In this work, we present Uni-Mol2 , an innovative molecular pretraining model that leverages a two-track transformer to effectively integrate features at the atomic level, graph level, and geometry structure level. Along with this, we systematically investigate the scaling law within molecular pretraining models, characterizing the power-law correlations between validation loss and model size, dataset size, and computational resources. Consequently, we successfully scale Uni-Mol2 to 1.1 billion parameters through pretraining on 800 million conformations, making it the largest molecular pretraining model to date. Extensive experiments show consistent improvement in the downstream tasks as the model size grows. The Uni-Mol2 with 1.1B parameters also outperforms existing methods, achieving an average 27% improvement on the QM9 and 14% on COMPAS-1D dataset.
Related papers
- Training Compute-Optimal Protein Language Models [48.79416103951816]
Most protein language models are trained with extensive compute resources until performance gains plateau.
Our investigation is grounded in a massive dataset consisting of 939 million protein sequences.
We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens.
arXiv Detail & Related papers (2024-11-04T14:58:37Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - The Role of Model Architecture and Scale in Predicting Molecular Properties: Insights from Fine-Tuning RoBERTa, BART, and LLaMA [0.0]
This study introduces a systematic framework to compare the efficacy of Large Language Models (LLMs) for fine-tuning across various cheminformatics tasks.
We assessed three well-known models-RoBERTa, BART, and LLaMA-on their ability to predict molecular properties.
We found that LLaMA-based models generally offered the lowest validation loss, suggesting their superior adaptability across tasks and scales.
arXiv Detail & Related papers (2024-05-02T02:20:12Z) - Diffusion-Based Neural Network Weights Generation [80.89706112736353]
D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning.
Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation.
Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
arXiv Detail & Related papers (2024-02-28T08:34:23Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Molecule-Morphology Contrastive Pretraining for Transferable Molecular
Representation [0.0]
We introduce Molecule-Morphology Contrastive Pretraining (MoCoP), a framework for learning multi-modal representation of molecular graphs and cellular morphologies.
We scale MoCoP to approximately 100K molecules and 600K morphological profiles using data from the JUMP-CP Consortium.
Our findings suggest that integrating cellular morphologies with molecular graphs using MoCoP can significantly improve the performance of QSAR models.
arXiv Detail & Related papers (2023-04-27T02:01:41Z) - MolE: a molecular foundation model for drug discovery [0.2802437011072858]
MolE is a molecular foundation model that adapts the DeBERTa architecture to be used on molecular graphs.
We show that fine-tuning pretrained MolE achieves state-of-the-art results on 9 of the 22 ADMET tasks included in the Therapeutic Data Commons.
arXiv Detail & Related papers (2022-11-03T21:22:05Z) - The Diminishing Returns of Masked Language Models to Science [0.7549732580284559]
We evaluate the impact of training data, model size, pretraining and finetuning time on 12 downstream scientific tasks.
We find that increasing model sizes, training data, or compute time does not always lead to significant improvements.
arXiv Detail & Related papers (2022-05-23T14:35:08Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - Semi-supervised physics guided deep learning framework for predicting
the I-V characteristics of GAN HEMT [0.0]
The framework is generic in nature and can be applied to model a phenomenon from other fields of research too as long as its behaviour is known.
A semi-supervised physics guided neural network (SPGNN) has been developed that predicts I-V characteristics of a gallium nitride-based high electron mobility transistor (GaN HEMT)
The SPGNN significantly reduces the requirement of the training data by more than 80% for achieving similar or better performance than a traditional neural network (TNN) even for unseen conditions.
arXiv Detail & Related papers (2021-10-20T18:48:50Z) - Learning Neural Generative Dynamics for Molecular Conformation
Generation [89.03173504444415]
We study how to generate molecule conformations (textiti.e., 3D structures) from a molecular graph.
We propose a novel probabilistic framework to generate valid and diverse conformations given a molecular graph.
arXiv Detail & Related papers (2021-02-20T03:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.