Compositional Deep Probabilistic Models of DNA Encoded Libraries
- URL: http://arxiv.org/abs/2310.13769v2
- Date: Tue, 13 Feb 2024 18:15:09 GMT
- Title: Compositional Deep Probabilistic Models of DNA Encoded Libraries
- Authors: Benson Chen, Mohammad M. Sultan, Theofanis Karaletsos
- Abstract summary: We introduce a compositional deep probabilistic model of DEL data, DEL-Compose, which decomposes molecular representations into their mono-synthon, di-synthon, and tri-synthon building blocks.
Our model demonstrates strong performance compared to count baselines, enriches the correct pharmacophores, and offers valuable insights via its intrinsic interpretable structure.
- Score: 6.206196935093064
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: DNA-Encoded Library (DEL) has proven to be a powerful tool that utilizes
combinatorially constructed small molecules to facilitate highly-efficient
screening assays. These selection experiments, involving multiple stages of
washing, elution, and identification of potent binders via unique DNA barcodes,
often generate complex data. This complexity can potentially mask the
underlying signals, necessitating the application of computational tools such
as machine learning to uncover valuable insights. We introduce a compositional
deep probabilistic model of DEL data, DEL-Compose, which decomposes molecular
representations into their mono-synthon, di-synthon, and tri-synthon building
blocks and capitalizes on the inherent hierarchical structure of these
molecules by modeling latent reactions between embedded synthons. Additionally,
we investigate methods to improve the observation models for DEL count data
such as integrating covariate factors to more effectively account for data
noise. Across two popular public benchmark datasets (CA-IX and HRP), our model
demonstrates strong performance compared to count baselines, enriches the
correct pharmacophores, and offers valuable insights via its intrinsic
interpretable structure, thereby providing a robust tool for the analysis of
DEL data.
Related papers
- KinDEL: DNA-Encoded Library Dataset for Kinase Inhibitors [2.0179908661487986]
We present KinDEL, one of the first large, publicly available DEL datasets on two kinases.
We benchmark different machine learning techniques to develop predictive models for hit identification.
We provide biophysical assay data, both on- and off-DNA, to validate our models on a smaller subset of molecules.
arXiv Detail & Related papers (2024-10-11T16:03:58Z) - Unlocking Potential Binders: Multimodal Pretraining DEL-Fusion for Denoising DNA-Encoded Libraries [51.72836644350993]
Multimodal Pretraining DEL-Fusion model (MPDF)
We develop pretraining tasks applying contrastive objectives between different compound representations and their text descriptions.
We propose a novel DEL-fusion framework that amalgamates compound information at the atomic, submolecular, and molecular levels.
arXiv Detail & Related papers (2024-09-07T17:32:21Z) - Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models [54.51932175059004]
We introduce a scalable method for generating synthetic instructions to enhance the code generation capability of Large Language Models.
The proposed algorithm, Genetic-Instruct, mimics evolutionary processes, utilizing self-instruction to create numerous synthetic samples from a limited number of seeds.
arXiv Detail & Related papers (2024-07-29T20:42:59Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - A robust synthetic data generation framework for machine learning in
High-Resolution Transmission Electron Microscopy (HRTEM) [1.0923877073891446]
Construction Zone is a Python package for rapidly generating complex nanoscale atomic structures.
We develop an end-to-end workflow for creating large simulated databases for training neural networks.
Using our results, we are able to achieve state-of-the-art segmentation performance on experimental HRTEM images of nanoparticles.
arXiv Detail & Related papers (2023-09-12T10:44:15Z) - Fast and Functional Structured Data Generators Rooted in
Out-of-Equilibrium Physics [62.997667081978825]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - DEL-Dock: Molecular Docking-Enabled Modeling of DNA-Encoded Libraries [1.290382979353427]
We introduce a new paradigm, DEL-Dock, that combines ligand-based descriptors with 3-D spatial information from docked protein-ligand complexes.
We show that our model is capable of effectively denoising DEL count data to predict molecule enrichment scores.
arXiv Detail & Related papers (2022-11-30T22:00:24Z) - BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot
Detection [63.447493500066045]
This work proposes a data driven learning model for the synthesis of keystroke biometric data.
The proposed method is compared with two statistical approaches based on Universal and User-dependent models.
Our experimental framework considers a dataset with 136 million keystroke events from 168 thousand subjects.
arXiv Detail & Related papers (2022-07-27T09:26:15Z) - Machine learning on DNA-encoded library count data using an
uncertainty-aware probabilistic loss function [1.5559232742666467]
We show a regression approach to learning DEL enrichments of individual molecules using a custom negative log-likelihood loss function.
We illustrate this approach on a dataset of 108k compounds screened against CAIX, and a dataset of 5.7M compounds screened against sEH and SIRT2.
arXiv Detail & Related papers (2021-08-27T19:37:06Z) - A Systematic Approach to Featurization for Cancer Drug Sensitivity
Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques.
We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.