Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking
- URL: http://arxiv.org/abs/2406.05738v1
- Date: Sun, 9 Jun 2024 11:13:03 GMT
- Title: Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking
- Authors: Thomas Le Menestrel, Manuel Rivas,
- Abstract summary: We introduce Smiles2Dock, an open large-scale multi-task dataset for molecular docking.
We dock 1.7 million from the ChEMBL database against 15 AlphaFold proteins, giving us more than 25 million protein-ligand binding scores.
Our dataset and code are publicly available to support the development of novel ML-based methods for molecular docking.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Docking is a crucial component in drug discovery aimed at predicting the binding conformation and affinity between small molecules and target proteins. ML-based docking has recently emerged as a prominent approach, outpacing traditional methods like DOCK and AutoDock Vina in handling the growing scale and complexity of molecular libraries. However, the availability of comprehensive and user-friendly datasets for training and benchmarking ML-based docking algorithms remains limited. We introduce Smiles2Dock, an open large-scale multi-task dataset for molecular docking. We created a framework combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL database against 15 AlphaFold proteins, giving us more than 25 million protein-ligand binding scores. The dataset leverages a wide range of high-accuracy AlphaFold protein models, encompasses a diverse set of biologically relevant compounds and enables researchers to benchmark all major approaches for ML-based docking such as Graph, Transformer and CNN-based methods. We also introduce a novel Transformer-based architecture for docking scores prediction and set it as an initial benchmark for our dataset. Our dataset and code are publicly available to support the development of novel ML-based methods for molecular docking to advance scientific research in this field.
Related papers
- GNNAS-Dock: Budget Aware Algorithm Selection with Graph Neural Networks for Molecular Docking [0.0]
This paper introduces GNNASDock, a novel Graph Graph Network (GNN)-based automated algorithm selection system for molecular docking in blind docking.
GNNs are accommodated to process the complex structural data of both situations and proteins.
They benefit from inherent graph-like properties to predict the performance of various docking algorithms under different conditions.
arXiv Detail & Related papers (2024-11-19T16:01:54Z) - Dockformer: A transformer-based molecular docking paradigm for large-scale virtual screening [29.947687129449278]
Deep learning algorithms can provide data-driven research and development models to increase the speed of the docking process.
A novel deep learning-based docking approach named Dockformer is introduced in this study.
The experimental results show that Dockformer achieves success rates of 90.53% and 82.71% on the PDBbind core set and PoseBusters benchmarks, respectively.
arXiv Detail & Related papers (2024-11-11T06:25:13Z) - ETDock: A Novel Equivariant Transformer for Protein-Ligand Docking [36.14826783009814]
Traditional docking methods rely on scoring functions and deep learning to predict the docking between proteins and drugs.
In this paper, we propose a transformer neural network for protein-ligand docking pose prediction.
The experimental results on real datasets show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2023-10-12T06:23:12Z) - DiffDock-PP: Rigid Protein-Protein Docking with Diffusion Models [47.73386438748902]
DiffDock-PP is a diffusion generative model that learns to translate and rotate unbound protein structures into their bound conformations.
We achieve state-of-the-art performance on DIPS with a median C-RMSD of 4.85, outperforming all considered baselines.
arXiv Detail & Related papers (2023-04-08T02:10:44Z) - Heterogenous Ensemble of Models for Molecular Property Prediction [55.91865861896012]
We propose a method for considering different modalities on molecules.
We ensemble these models with a HuberRegressor.
This yields a winning solution to the 2textsuperscriptnd edition of the OGB Large-Scale Challenge (2022)
arXiv Detail & Related papers (2022-11-20T17:25:26Z) - Deep Surrogate Docking: Accelerating Automated Drug Discovery with Graph
Neural Networks [0.9785311158871759]
We introduce Deep Surrogate Docking (DSD), a framework that applies deep learning-based surrogate modeling to accelerate the docking process substantially.
We show that the DSD workflow combined with the FiLMv2 architecture provides a 9.496x speedup in molecule screening with a 3% recall error rate.
arXiv Detail & Related papers (2022-11-04T19:36:02Z) - Learning with MISELBO: The Mixture Cookbook [62.75516608080322]
We present the first ever mixture of variational approximations for a normalizing flow-based hierarchical variational autoencoder (VAE) with VampPrior and a PixelCNN decoder network.
We explain this cooperative behavior by drawing a novel connection between VI and adaptive importance sampling.
We obtain state-of-the-art results among VAE architectures in terms of negative log-likelihood on the MNIST and FashionMNIST datasets.
arXiv Detail & Related papers (2022-09-30T15:01:35Z) - HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein
Language Model as an Alternative [61.984700682903096]
HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2.
Our proposed method pre-trains a large-scale protein language model with thousands of millions of primary sequences.
We obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence.
arXiv Detail & Related papers (2022-07-28T07:30:33Z) - Direct Molecular Conformation Generation [217.4815525740703]
We propose a method that directly predicts the coordinates of atoms.
Our method achieves state-of-the-art results on four public benchmarks.
arXiv Detail & Related papers (2022-02-03T01:01:58Z) - DOCKSTRING: easy molecular docking yields better benchmarks for ligand
design [3.848364262836075]
We present DOCKSTRING, a bundle for meaningful and robust comparison of machine learning models consisting of three components.
The Python package implements a robust ligand and target preparation protocol that allows non-experts to obtain meaningful docking scores.
Our dataset is the first to include docking poses, as well as the first of its size that is a full matrix.
arXiv Detail & Related papers (2021-10-29T01:37:13Z) - MIMOSA: Multi-constraint Molecule Sampling for Molecule Optimization [51.00815310242277]
generative models and reinforcement learning approaches made initial success, but still face difficulties in simultaneously optimizing multiple drug properties.
We propose the MultI-constraint MOlecule SAmpling (MIMOSA) approach, a sampling framework to use input molecule as an initial guess and sample molecules from the target distribution.
arXiv Detail & Related papers (2020-10-05T20:18:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.