Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning
- URL: http://arxiv.org/abs/2508.06199v3
- Date: Sun, 14 Sep 2025 11:09:51 GMT
- Title: Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning
- Authors: Mateusz Praski, Jakub Adamczyk, Wojciech Czech,
- Abstract summary: Pretrained neural networks have attracted significant interest in chemistry and small molecule drug design.<n>This study presents the most extensive comparison of such models to date, evaluating 25 models across 25 datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pretrained neural networks have attracted significant interest in chemistry and small molecule drug design. Embeddings from these models are widely used for molecular property prediction, virtual screening, and small data learning in molecular chemistry. This study presents the most extensive comparison of such models to date, evaluating 25 models across 25 datasets. Under a fair comparison framework, we assess models spanning various modalities, architectures, and pretraining strategies. Using a dedicated hierarchical Bayesian statistical testing model, we arrive at a surprising result: nearly all neural models show negligible or no improvement over the baseline ECFP molecular fingerprint. Only the CLAMP model, which is also based on molecular fingerprints, performs statistically significantly better than the alternatives. These findings raise concerns about the evaluation rigor in existing studies. We discuss potential causes, propose solutions, and offer practical recommendations.
Related papers
- Unveiling Scaling Behaviors in Molecular Language Models: Effects of Model Size, Data, and Representation [18.008217765253274]
We investigate the scaling behavior of molecular language models across both pretraining and downstream tasks.<n>Results demonstrate clear scaling laws in molecular models for both pretraining and downstream transfer.<n>We release the largest library of molecular language models to date to facilitate future research and development.
arXiv Detail & Related papers (2026-01-30T09:32:12Z) - Foundation Models for Discovery and Exploration in Chemical Space [57.97784111110166]
MIST is a family of molecular foundation models trained on large unlabeled datasets.<n>We demonstrate the ability of these models to solve real-world problems across chemical space.
arXiv Detail & Related papers (2025-10-20T17:56:01Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations [68.32093648671496]
We introduce GODE, which accounts for the dual-level structure inherent in molecules.<n> Molecules possess an intrinsic graph structure and simultaneously function as nodes within a broader molecular knowledge graph.<n>By pre-training two GNNs on different graph structures, GODE effectively fuses molecular structures with their corresponding knowledge graph substructures.
arXiv Detail & Related papers (2023-06-02T15:49:45Z) - Molecule-Morphology Contrastive Pretraining for Transferable Molecular
Representation [0.0]
We introduce Molecule-Morphology Contrastive Pretraining (MoCoP), a framework for learning multi-modal representation of molecular graphs and cellular morphologies.
We scale MoCoP to approximately 100K molecules and 600K morphological profiles using data from the JUMP-CP Consortium.
Our findings suggest that integrating cellular morphologies with molecular graphs using MoCoP can significantly improve the performance of QSAR models.
arXiv Detail & Related papers (2023-04-27T02:01:41Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - Calibration and generalizability of probabilistic models on low-data
chemical datasets with DIONYSUS [0.0]
We perform an extensive study of the calibration and generalizability of probabilistic machine learning models on small chemical datasets.
We analyse the quality of their predictions and uncertainties in a variety of tasks (binary, regression) and datasets.
We offer practical insights into model and feature choice for modelling small chemical datasets, a common scenario in new chemical experiments.
arXiv Detail & Related papers (2022-12-03T08:19:06Z) - Unraveling Key Elements Underlying Molecular Property Prediction: A
Systematic Study [27.56700461408765]
Key elements underlying molecular property prediction remain largely unexplored.
We conduct an extensive evaluation of representative models using various representations on the MoleculeNet datasets.
In total, we have trained 62,820 models, including 50,220 models on fixed representations, 4,200 models on SMILES sequences and 8,400 models on molecular graphs.
arXiv Detail & Related papers (2022-09-26T14:07:59Z) - A multi-stage machine learning model on diagnosis of esophageal
manometry [50.591267188664666]
The framework includes deep-learning models at the swallow-level stage and feature-based machine learning models at the study-level stage.
This is the first artificial-intelligence-style model to automatically predict CC diagnosis of HRM study from raw multi-swallow data.
arXiv Detail & Related papers (2021-06-25T20:09:23Z) - Using Molecular Embeddings in QSAR Modeling: Does it Make a Difference? [0.6299766708197883]
We reviewed the literature on methods for molecular embeddings and reproduced three unsupervised and two supervised molecular embedding techniques.
We compared these five methods concerning their performance in QSAR scenarios using different classification and regression datasets.
Our results highlight the need for conducting a careful comparison and analysis of the different embedding techniques prior to using them in drug design tasks.
arXiv Detail & Related papers (2021-03-20T21:45:22Z) - Few-Shot Graph Learning for Molecular Property Prediction [46.60746023179724]
We propose Meta-MGNN, a novel model for few-shot molecular property prediction.
To exploit unlabeled molecular information, Meta-MGNN further incorporates molecular structure, attribute based self-supervised modules and self-attentive task weights.
Extensive experiments on two public multi-property datasets demonstrate that Meta-MGNN outperforms a variety of state-of-the-art methods.
arXiv Detail & Related papers (2021-02-16T01:55:34Z) - Conditional Constrained Graph Variational Autoencoders for Molecule
Design [70.59828655929194]
We present Conditional Constrained Graph Variational Autoencoder (CCGVAE), a model that implements this key-idea in a state-of-the-art model.
We show improved results on several evaluation metrics on two commonly adopted datasets for molecule generation.
arXiv Detail & Related papers (2020-09-01T21:58:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.