Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First
Data Release
- URL: http://arxiv.org/abs/2006.02431v1
- Date: Thu, 28 May 2020 01:33:07 GMT
- Title: Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First
Data Release
- Authors: Yadu Babuji, Ben Blaiszik, Tom Brettin, Kyle Chard, Ryan Chard, Austin
Clyde, Ian Foster, Zhi Hong, Shantenu Jha, Zhuozhao Li, Xuefeng Liu, Arvind
Ramanathan, Yi Ren, Nicholaus Saint, Marcus Schwarting, Rick Stevens,
Hubertus van Dam, Rick Wagner
- Abstract summary: This data release encompasses structural information on the 4.2 B molecules and 60 TB of pre-computed data.
One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules.
Future releases will expand the data to include more detailed molecular simulations, computed models, and other products.
- Score: 8.090016327163564
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Researchers across the globe are seeking to rapidly repurpose existing drugs
or discover new drugs to counter the the novel coronavirus disease (COVID-19)
caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One
promising approach is to train machine learning (ML) and artificial
intelligence (AI) tools to screen large numbers of small molecules. As a
contribution to that effort, we are aggregating numerous small molecules from a
variety of sources, using high-performance computing (HPC) to computer diverse
properties of those molecules, using the computed properties to train ML/AI
models, and then using the resulting models for screening. In this first data
release, we make available 23 datasets collected from community sources
representing over 4.2 B molecules enriched with pre-computed: 1) molecular
fingerprints to aid similarity searches, 2) 2D images of molecules to enable
exploration and application of image-based deep learning methods, and 3) 2D and
3D molecular descriptors to speed development of machine learning models. This
data release encompasses structural information on the 4.2 B molecules and 60
TB of pre-computed data. Future releases will expand the data to include more
detailed molecular simulations, computed models, and other products.
Related papers
- Data-Efficient Molecular Generation with Hierarchical Textual Inversion [48.816943690420224]
We introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method.
HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution.
Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution.
arXiv Detail & Related papers (2024-05-05T08:35:23Z) - TwinBooster: Synergising Large Language Models with Barlow Twins and
Gradient Boosting for Enhanced Molecular Property Prediction [0.0]
In this study, we use a fine-tuned large language model to integrate biological assays based on their textual information.
This architecture uses both assay information and molecular fingerprints to extract the true molecular information.
TwinBooster enables the prediction of properties of unseen bioassays and molecules by providing state-of-the-art zero-shot learning tasks.
arXiv Detail & Related papers (2024-01-09T10:36:20Z) - Bi-level Contrastive Learning for Knowledge-Enhanced Molecule
Representations [55.42602325017405]
We propose a novel method called GODE, which takes into account the two-level structure of individual molecules.
By pre-training two graph neural networks (GNNs) on different graph structures, combined with contrastive learning, GODE fuses molecular structures with their corresponding knowledge graph substructures.
When fine-tuned across 11 chemical property tasks, our model outperforms existing benchmarks, registering an average ROC-AUC uplift of 13.8% for classification tasks and an average RMSE/MAE enhancement of 35.1% for regression tasks.
arXiv Detail & Related papers (2023-06-02T15:49:45Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - Graph-based Molecular Representation Learning [59.06193431883431]
Molecular representation learning (MRL) is a key step to build the connection between machine learning and chemical science.
Recently, MRL has achieved considerable progress, especially in methods based on deep molecular graph learning.
arXiv Detail & Related papers (2022-07-08T17:43:20Z) - 3D Graph Contrastive Learning for Molecular Property Prediction [1.0152838128195467]
Self-supervised learning (SSL) is a method that learns the data representation by utilizing supervision inherent in the data.
We propose a novel contrastive learning framework, small-scale 3D Graph Contrastive Learning (3DGCL) for molecular property prediction.
arXiv Detail & Related papers (2022-05-31T04:45:31Z) - MoleHD: Ultra-Low-Cost Drug Discovery using Hyperdimensional Computing [2.7462881838152913]
We present MoleHD, a method based on brain-inspired hyperdimensional computing (HDC) for molecular property prediction.
MoleHD achieves highest ROC-AUC score on random and scaffold splits on average across 3 datasets.
To the best of our knowledge, this is the first HDC-based method for drug discovery.
arXiv Detail & Related papers (2021-06-05T13:33:21Z) - Molecular machine learning with conformer ensembles [0.0]
We introduce multiple deep learning models that expand upon key architectures such as ChemProp and Schnet.
We then benchmark the performance trade-offs of these models on 2D, 3D and 4D representations in the prediction of drug activity.
The new architectures perform significantly better than 2D models, but their performance is often just as strong with a single conformer as with many.
arXiv Detail & Related papers (2020-12-15T17:44:48Z) - Advanced Graph and Sequence Neural Networks for Molecular Property
Prediction and Drug Discovery [53.00288162642151]
We develop MoleculeKit, a suite of comprehensive machine learning tools spanning different computational models and molecular representations.
Built on these representations, MoleculeKit includes both deep learning and traditional machine learning methods for graph and sequence data.
Results on both online and offline antibiotics discovery and molecular property prediction tasks show that MoleculeKit achieves consistent improvements over prior methods.
arXiv Detail & Related papers (2020-12-02T02:09:31Z) - Self-Supervised Graph Transformer on Large-Scale Molecular Data [73.3448373618865]
We propose a novel framework, GROVER, for molecular representation learning.
GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data.
We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning.
arXiv Detail & Related papers (2020-06-18T08:37:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.