Transformers for molecular property prediction: Domain adaptation efficiently improves performance
- URL: http://arxiv.org/abs/2503.03360v2
- Date: Fri, 07 Mar 2025 08:55:13 GMT
- Title: Transformers for molecular property prediction: Domain adaptation efficiently improves performance
- Authors: Afnan Sultan, Max Rausch-Dupont, Shahrukh Khan, Olga Kalinina, Andrea Volkamer, Dietrich Klakow,
- Abstract summary: The aim of this study is to investigate and overcome some of the limitations of transformer models in predicting molecular properties.<n>We examine the impact of pre-training dataset size and diversity on the performance of transformer models.
- Score: 12.556171106847811
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Most of the current transformer-based chemical language models are pre-trained on millions to billions of molecules. However, the improvement from such scaling in dataset size is not confidently linked to improved molecular property prediction. The aim of this study is to investigate and overcome some of the limitations of transformer models in predicting molecular properties. Specifically, we examine the impact of pre-training dataset size and diversity on the performance of transformer models and investigate the use of domain adaptation as a technique for improving model performance. First, our findings indicate that increasing pretraining dataset size beyond 400K molecules from the GuacaMol dataset does not result in a significant improvement on four ADME endpoints, namely, solubility, permeability, microsomal stability, and plasma protein binding. Second, our results demonstrate that using domain adaptation by further training the transformer model on a small set of domain-relevant molecules, i.e., a few hundred to a few thousand, using multi-task regression of physicochemical properties was sufficient to significantly improve performance for three out of the four investigated ADME endpoints (P-value < 0.001). Finally, we observe that a model pre-trained on 400K molecules and domain adopted on a few hundred/thousand molecules performs similarly (P-value > 0.05) to more complicated transformer models like MolBERT(pre-trained on 1.3M molecules) and MolFormer (pre-trained on 100M molecules). A comparison to a random forest model trained on basic physicochemical properties showed similar performance to the examined transformer models. We believe that current transformer models can be improved through further systematic analysis of pre-training and downstream data, pre-training objectives, and scaling laws, ultimately leading to better and more helpful models.
Related papers
- Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - Generative Model for Small Molecules with Latent Space RL Fine-Tuning to Protein Targets [4.047608146173188]
We introduce a modification to SAFE to reduce the number of invalid fragmented molecules generated during training.
Our model can generate novel molecules with a validity rate > 90% and a fragmentation rate 1% by sampling from a latent space.
arXiv Detail & Related papers (2024-07-02T16:01:37Z) - Uni-Mol2: Exploring Molecular Pretraining Model at Scale [27.172011090947823]
We present Uni-Mol2, an innovative molecular pretraining model that integrates features at the atomic level, graph level, and geometry structure level.
We successfully scale Uni-Mol2 to 1.1 billion parameters through pretraining on 800 million conformations, making it the largest molecular pretraining model to date.
arXiv Detail & Related papers (2024-06-21T08:28:54Z) - Transformers for molecular property prediction: Lessons learned from the past five years [0.0]
We analyze the currently available models and explore key questions that arise when training and fine-tuning a transformer model for MPP.
We address the challenges in comparing different models, emphasizing the need for standardized data splitting and robust statistical analysis.
arXiv Detail & Related papers (2024-04-05T09:05:37Z) - GP-MoLFormer: A Foundation Model For Molecular Generation [30.06169570297667]
We extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks.
Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1B (billion) chemical SMILES.
arXiv Detail & Related papers (2024-04-04T16:20:06Z) - Molecule Design by Latent Prompt Transformer [76.2112075557233]
This work explores the challenging problem of molecule design by framing it as a conditional generative modeling task.
We propose a novel generative model comprising three components: (1) a latent vector with a learnable prior distribution; (2) a molecule generation model based on a causal Transformer, which uses the latent vector as a prompt; and (3) a property prediction model that predicts a molecule's target properties and/or constraint values using the latent prompt.
arXiv Detail & Related papers (2024-02-27T03:33:23Z) - Molecule Design by Latent Space Energy-Based Modeling and Gradual
Distribution Shifting [53.44684898432997]
Generation of molecules with desired chemical and biological properties is critical for drug discovery.
We propose a probabilistic generative model to capture the joint distribution of molecules and their properties.
Our method achieves very strong performances on various molecule design tasks.
arXiv Detail & Related papers (2023-06-09T03:04:21Z) - Pre-training Transformers for Molecular Property Prediction Using
Reaction Prediction [0.0]
Transfer learning has had a tremendous impact in fields like Computer Vision and Natural Language Processing.
We present a pre-training procedure for molecular representation learning using reaction data.
We show a statistically significant positive effect on 5 of the 12 tasks compared to a non-pre-trained baseline model.
arXiv Detail & Related papers (2022-07-06T14:51:38Z) - Accurate Machine Learned Quantum-Mechanical Force Fields for
Biomolecular Simulations [51.68332623405432]
Molecular dynamics (MD) simulations allow atomistic insights into chemical and biological processes.
Recently, machine learned force fields (MLFFs) emerged as an alternative means to execute MD simulations.
This work proposes a general approach to constructing accurate MLFFs for large-scale molecular simulations.
arXiv Detail & Related papers (2022-05-17T13:08:28Z) - Molecular Attributes Transfer from Non-Parallel Data [57.010952598634944]
We formulate molecular optimization as a style transfer problem and present a novel generative model that could automatically learn internal differences between two groups of non-parallel data.
Experiments on two molecular optimization tasks, toxicity modification and synthesizability improvement, demonstrate that our model significantly outperforms several state-of-the-art methods.
arXiv Detail & Related papers (2021-11-30T06:10:22Z) - Scale Efficiently: Insights from Pre-training and Fine-tuning
Transformers [57.931830650323]
This paper presents scaling insights from pretraining and finetuning Transformers.
We show that aside from only the model size, model shape matters for downstream fine-tuning.
We present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality.
arXiv Detail & Related papers (2021-09-22T12:29:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.