Activity Cliff Prediction: Dataset and Benchmark
- URL: http://arxiv.org/abs/2302.07541v1
- Date: Wed, 15 Feb 2023 09:19:07 GMT
- Title: Activity Cliff Prediction: Dataset and Benchmark
- Authors: Ziqiao Zhang, Bangyi Zhao, Ailin Xie, Yatao Bian, Shuigeng Zhou
- Abstract summary: We first introduce ACNet, a large-scale dataset for AC prediction.
ACNet curates over 400K Matched Molecular Pairs (MMPs) against 190 targets.
We propose a baseline framework to benchmark the predictive performance of molecular representations encoded by deep neural networks for AC prediction.
- Score: 20.41770222873952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Activity cliffs (ACs), which are generally defined as pairs of structurally
similar molecules that are active against the same bio-target but significantly
different in the binding potency, are of great importance to drug discovery. Up
to date, the AC prediction problem, i.e., to predict whether a pair of
molecules exhibit the AC relationship, has not yet been fully explored. In this
paper, we first introduce ACNet, a large-scale dataset for AC prediction. ACNet
curates over 400K Matched Molecular Pairs (MMPs) against 190 targets, including
over 20K MMP-cliffs and 380K non-AC MMPs, and provides five subsets for model
development and evaluation. Then, we propose a baseline framework to benchmark
the predictive performance of molecular representations encoded by deep neural
networks for AC prediction, and 16 models are evaluated in experiments. Our
experimental results show that deep learning models can achieve good
performance when the models are trained on tasks with adequate amount of data,
while the imbalanced, low-data and out-of-distribution features of the ACNet
dataset still make it challenging for deep neural networks to cope with. In
addition, the traditional ECFP method shows a natural advantage on MMP-cliff
prediction, and outperforms other deep learning models on most of the data
subsets. To the best of our knowledge, our work constructs the first
large-scale dataset for AC prediction, which may stimulate the study of AC
prediction models and prompt further breakthroughs in AI-aided drug discovery.
The codes and dataset can be accessed by https://drugai.github.io/ACNet/.
Related papers
- Transfer Learning for Molecular Property Predictions from Small Data Sets [0.0]
We benchmark common machine learning models for the prediction of molecular properties on two small data sets.
We present a transfer learning strategy that uses large data sets to pre-train the respective models and allows to obtain more accurate models after fine-tuning on the original data sets.
arXiv Detail & Related papers (2024-04-20T14:25:34Z) - A Meta-Learning Approach to Predicting Performance and Data Requirements [163.4412093478316]
We propose an approach to estimate the number of samples required for a model to reach a target performance.
We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset.
We introduce a novel piecewise power law (PPL) that handles the two data differently.
arXiv Detail & Related papers (2023-03-02T21:48:22Z) - Clinical Deterioration Prediction in Brazilian Hospitals Based on
Artificial Neural Networks and Tree Decision Models [56.93322937189087]
An extremely boosted neural network (XBNet) is used to predict clinical deterioration (CD)
The XGBoost model obtained the best results in predicting CD among Brazilian hospitals' data.
arXiv Detail & Related papers (2022-12-17T23:29:14Z) - Deep Learning Architectures for FSCV, a Comparison [0.0]
Suitability is determined by the predictive performance in the "out-of-probe" case, the response to artificially induced electrical noise, and the ability to predict when the model will be errant for a given probe.
The InceptionTime architecture, a deep convolutional neural network, has the best absolute predictive performance of the models tested but was more susceptible to noise.
A naive multilayer perceptron architecture had the second lowest prediction error and was less affected by the artificial noise, suggesting that convolutions may not be as important for this task as one might suspect.
arXiv Detail & Related papers (2022-12-05T00:20:10Z) - Ensemble Machine Learning Model Trained on a New Synthesized Dataset
Generalizes Well for Stress Prediction Using Wearable Devices [3.006016887654771]
We investigate the generalization ability of models built on datasets containing a small number of subjects, recorded in single study protocols.
We propose and evaluate the use of ensemble techniques by combining gradient boosting with an artificial neural network to measure predictive power on new, unseen data.
arXiv Detail & Related papers (2022-09-30T00:20:57Z) - Pre-training via Denoising for Molecular Property Prediction [53.409242538744444]
We describe a pre-training technique that utilizes large datasets of 3D molecular structures at equilibrium.
Inspired by recent advances in noise regularization, our pre-training objective is based on denoising.
arXiv Detail & Related papers (2022-05-31T22:28:34Z) - Learning brain MRI quality control: a multi-factorial generalization
problem [0.0]
This work aimed at evaluating the performances of the MRIQC pipeline on various large-scale datasets.
We focused our analysis on the MRIQC preprocessing steps and tested the pipeline with and without them.
We concluded that a model trained with data from a heterogeneous population, such as the CATI dataset, provides the best scores on unseen data.
arXiv Detail & Related papers (2022-05-31T15:46:44Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Comparing hundreds of machine learning classifiers and discrete choice
models in predicting travel behavior: an empirical benchmark [3.0969191504482247]
This study seeks to provide a generalizable empirical benchmark by comparing hundreds of machine learning (ML) and discrete choice models (DCMs)
Experiments evaluate both prediction accuracy and computational cost by spanning four hyper-dimensions.
Deep neural networks achieve the highest predictive performance, but at a relatively high computational cost.
arXiv Detail & Related papers (2021-02-01T19:45:47Z) - A Systematic Approach to Featurization for Cancer Drug Sensitivity
Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques.
We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z) - Assessing Graph-based Deep Learning Models for Predicting Flash Point [52.931492216239995]
Graph-based deep learning (GBDL) models were implemented in predicting flash point for the first time.
Average R2 and Mean Absolute Error (MAE) scores of MPNN are, respectively, 2.3% lower and 2.0 K higher than previous comparable studies.
arXiv Detail & Related papers (2020-02-26T06:10:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.