Machine learning on DNA-encoded library count data using an
uncertainty-aware probabilistic loss function
- URL: http://arxiv.org/abs/2108.12471v1
- Date: Fri, 27 Aug 2021 19:37:06 GMT
- Title: Machine learning on DNA-encoded library count data using an
uncertainty-aware probabilistic loss function
- Authors: Katherine S. Lim, Andrew G. Reidenbach, Bruce K. Hua, Jeremy W. Mason,
Christopher J. Gerry, Paul A. Clemons, Connor W. Coley
- Abstract summary: We show a regression approach to learning DEL enrichments of individual molecules using a custom negative log-likelihood loss function.
We illustrate this approach on a dataset of 108k compounds screened against CAIX, and a dataset of 5.7M compounds screened against sEH and SIRT2.
- Score: 1.5559232742666467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: DNA-encoded library (DEL) screening and quantitative structure-activity
relationship (QSAR) modeling are two techniques used in drug discovery to find
small molecules that bind a protein target. Applying QSAR modeling to DEL data
can facilitate the selection of compounds for off-DNA synthesis and evaluation.
Such a combined approach has been shown recently by training binary classifiers
to learn DEL enrichments of aggregated "disynthons" to accommodate the sparse
and noisy nature of DEL data. However, a binary classifier cannot distinguish
between different levels of enrichment, and information is potentially lost
during disynthon aggregation. Here, we demonstrate a regression approach to
learning DEL enrichments of individual molecules using a custom negative
log-likelihood loss function that effectively denoises DEL data and introduces
opportunities for visualization of learned structure-activity relationships
(SAR). Our approach explicitly models the Poisson statistics of the sequencing
process used in the DEL experimental workflow under a frequentist view. We
illustrate this approach on a dataset of 108k compounds screened against CAIX,
and a dataset of 5.7M compounds screened against sEH and SIRT2. Due to the
treatment of uncertainty in the data through the negative log-likelihood loss
function, the models can ignore low-confidence outliers. While our approach
does not demonstrate a benefit for extrapolation to novel structures, we expect
our denoising and visualization pipeline to be useful in identifying SAR trends
and enriched pharmacophores in DEL data. Further, this approach to
uncertainty-aware regression is applicable to other sparse or noisy datasets
where the nature of stochasticity is known or can be modeled; in particular,
the Poisson enrichment ratio metric we use can apply to other settings that
compare sequencing count data between two experimental conditions.
Related papers
- DEL-Ranking: Ranking-Correction Denoising Framework for Elucidating Molecular Affinities in DNA-Encoded Libraries [43.47251247740565]
DNA-encoded library (DEL) screening has revolutionized the detection of protein-ligand interactions through read counts.
noise in read counts, stemming from nonspecific interactions, can mislead this exploration process.
We present DEL-Ranking, a distribution-correction denoising framework that addresses these challenges.
arXiv Detail & Related papers (2024-10-19T02:32:09Z) - Exploiting the Data Gap: Utilizing Non-ignorable Missingness to Manipulate Model Learning [13.797822374912773]
Adversarial Missingness (AM) attacks are motivated by maliciously engineering non-ignorable missingness mechanisms.
In this work we focus on associational learning in the context of AM attacks.
We formulate the learning of the adversarial missingness mechanism as a bi-level optimization.
arXiv Detail & Related papers (2024-09-06T17:10:28Z) - Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks.
It is quite beneficial and challenging to detect poisoned samples from a mixed dataset.
We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z) - Extracting Training Data from Unconditional Diffusion Models [76.85077961718875]
diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI)
We aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization.
Based on the theoretical analysis, we propose a novel data extraction method called textbfSurrogate condItional Data Extraction (SIDE) that leverages a trained on generated data as a surrogate condition to extract training data directly from unconditional diffusion models.
arXiv Detail & Related papers (2024-06-18T16:20:12Z) - Assessing Neural Network Representations During Training Using
Noise-Resilient Diffusion Spectral Entropy [55.014926694758195]
Entropy and mutual information in neural networks provide rich information on the learning process.
We leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures.
We show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data.
arXiv Detail & Related papers (2023-12-04T01:32:42Z) - Compositional Deep Probabilistic Models of DNA Encoded Libraries [6.206196935093064]
We introduce a compositional deep probabilistic model of DEL data, DEL-Compose, which decomposes molecular representations into their mono-synthon, di-synthon, and tri-synthon building blocks.
Our model demonstrates strong performance compared to count baselines, enriches the correct pharmacophores, and offers valuable insights via its intrinsic interpretable structure.
arXiv Detail & Related papers (2023-10-20T19:04:28Z) - DEL-Dock: Molecular Docking-Enabled Modeling of DNA-Encoded Libraries [1.290382979353427]
We introduce a new paradigm, DEL-Dock, that combines ligand-based descriptors with 3-D spatial information from docked protein-ligand complexes.
We show that our model is capable of effectively denoising DEL count data to predict molecule enrichment scores.
arXiv Detail & Related papers (2022-11-30T22:00:24Z) - DynImp: Dynamic Imputation for Wearable Sensing Data Through Sensory and
Temporal Relatedness [78.98998551326812]
We argue that traditional methods have rarely made use of both times-series dynamics of the data as well as the relatedness of the features from different sensors.
We propose a model, termed as DynImp, to handle different time point's missingness with nearest neighbors along feature axis.
We show that the method can exploit the multi-modality features from related sensors and also learn from history time-series dynamics to reconstruct the data under extreme missingness.
arXiv Detail & Related papers (2022-09-26T21:59:14Z) - Learn from Unpaired Data for Image Restoration: A Variational Bayes
Approach [18.007258270845107]
We propose LUD-VAE, a deep generative method to learn the joint probability density function from data sampled from marginal distributions.
We apply our method to real-world image denoising and super-resolution tasks and train the models using the synthetic data generated by the LUD-VAE.
arXiv Detail & Related papers (2022-04-21T13:27:17Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Efficient Causal Inference from Combined Observational and
Interventional Data through Causal Reductions [68.6505592770171]
Unobserved confounding is one of the main challenges when estimating causal effects.
We propose a novel causal reduction method that replaces an arbitrary number of possibly high-dimensional latent confounders.
We propose a learning algorithm to estimate the parameterized reduced model jointly from observational and interventional data.
arXiv Detail & Related papers (2021-03-08T14:29:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.