MolPILE - large-scale, diverse dataset for molecular representation learning
- URL: http://arxiv.org/abs/2509.18353v2
- Date: Thu, 25 Sep 2025 10:48:11 GMT
- Title: MolPILE - large-scale, diverse dataset for molecular representation learning
- Authors: Jakub Adamczyk, Jakub Poziemski, Franciszek Job, Mateusz Król, Maciej Makowski,
- Abstract summary: MolPILE is a large-scale, diverse, and rigorously curated collection of 222 million compounds.<n>This work provides a standardized resource for model training, addressing the pressing need for an ImageNet-like dataset in molecular chemistry.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The size, diversity, and quality of pretraining datasets critically determine the generalization ability of foundation models. Despite their growing importance in chemoinformatics, the effectiveness of molecular representation learning has been hindered by limitations in existing small molecule datasets. To address this gap, we present MolPILE, large-scale, diverse, and rigorously curated collection of 222 million compounds, constructed from 6 large-scale databases using an automated curation pipeline. We present a comprehensive analysis of current pretraining datasets, highlighting considerable shortcomings for training ML models, and demonstrate how retraining existing models on MolPILE yields improvements in generalization performance. This work provides a standardized resource for model training, addressing the pressing need for an ImageNet-like dataset in molecular chemistry.
Related papers
- Foundation Models for Discovery and Exploration in Chemical Space [57.97784111110166]
MIST is a family of molecular foundation models trained on large unlabeled datasets.<n>We demonstrate the ability of these models to solve real-world problems across chemical space.
arXiv Detail & Related papers (2025-10-20T17:56:01Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - A Large Encoder-Decoder Family of Foundation Models For Chemical Language [1.1073864511426255]
This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem.
Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks.
arXiv Detail & Related papers (2024-07-24T20:30:39Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations [68.32093648671496]
We introduce GODE, which accounts for the dual-level structure inherent in molecules.<n> Molecules possess an intrinsic graph structure and simultaneously function as nodes within a broader molecular knowledge graph.<n>By pre-training two GNNs on different graph structures, GODE effectively fuses molecular structures with their corresponding knowledge graph substructures.
arXiv Detail & Related papers (2023-06-02T15:49:45Z) - Calibration and generalizability of probabilistic models on low-data
chemical datasets with DIONYSUS [0.0]
We perform an extensive study of the calibration and generalizability of probabilistic machine learning models on small chemical datasets.
We analyse the quality of their predictions and uncertainties in a variety of tasks (binary, regression) and datasets.
We offer practical insights into model and feature choice for modelling small chemical datasets, a common scenario in new chemical experiments.
arXiv Detail & Related papers (2022-12-03T08:19:06Z) - MolE: a molecular foundation model for drug discovery [0.2802437011072858]
MolE is a molecular foundation model that adapts the DeBERTa architecture to be used on molecular graphs.
We show that fine-tuning pretrained MolE achieves state-of-the-art results on 9 of the 22 ADMET tasks included in the Therapeutic Data Commons.
arXiv Detail & Related papers (2022-11-03T21:22:05Z) - Unraveling Key Elements Underlying Molecular Property Prediction: A
Systematic Study [27.56700461408765]
Key elements underlying molecular property prediction remain largely unexplored.
We conduct an extensive evaluation of representative models using various representations on the MoleculeNet datasets.
In total, we have trained 62,820 models, including 50,220 models on fixed representations, 4,200 models on SMILES sequences and 8,400 models on molecular graphs.
arXiv Detail & Related papers (2022-09-26T14:07:59Z) - Retrieval-based Controllable Molecule Generation [63.44583084888342]
We propose a new retrieval-based framework for controllable molecule generation.
We use a small set of molecules to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria.
Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning.
arXiv Detail & Related papers (2022-08-23T17:01:16Z) - Do Large Scale Molecular Language Representations Capture Important
Structural Information? [31.76876206167457]
We present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer.
Experiments show that the learned molecular representation performs competitively, when compared to graph-based and fingerprint-based supervised learning baselines.
arXiv Detail & Related papers (2021-06-17T14:33:55Z) - Self-Supervised Graph Transformer on Large-Scale Molecular Data [73.3448373618865]
We propose a novel framework, GROVER, for molecular representation learning.
GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data.
We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning.
arXiv Detail & Related papers (2020-06-18T08:37:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.