Related papers: Synthetic Data from Diffusion Models Improve Drug Discovery Prediction

Synthetic Data from Diffusion Models Improve Drug Discovery Prediction

URL: http://arxiv.org/abs/2405.03799v1
Date: Mon, 6 May 2024 19:09:37 GMT
Title: Synthetic Data from Diffusion Models Improve Drug Discovery Prediction
Authors: Bing Hu, Ashish Saragadam, Anita Layton, Helen Chen,
Abstract summary: Data sparsity makes data curation difficult for researchers looking to answer key research questions. We propose a novel diffusion GNN model Syngand capable of generating ligand and pharmacokinetic data end-to-end. We show initial promising results on the efficacy of the Syngand-generated synthetic target property data on downstream regression tasks with AqSolDB, LD50, and hERG central.
Score: 1.3686993145787065
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Artificial intelligence (AI) is increasingly used in every stage of drug development. Continuing breakthroughs in AI-based methods for drug discovery require the creation, improvement, and refinement of drug discovery data. We posit a new data challenge that slows the advancement of drug discovery AI: datasets are often collected independently from each other, often with little overlap, creating data sparsity. Data sparsity makes data curation difficult for researchers looking to answer key research questions requiring values posed across multiple datasets. We propose a novel diffusion GNN model Syngand capable of generating ligand and pharmacokinetic data end-to-end. We show and provide a methodology for sampling pharmacokinetic data for existing ligands using our Syngand model. We show the initial promising results on the efficacy of the Syngand-generated synthetic target property data on downstream regression tasks with AqSolDB, LD50, and hERG central. Using our proposed model and methodology, researchers can easily generate synthetic ligand data to help them explore research questions that require data spanning multiple datasets.

Related papers

Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference [89.5628648718851]
Causal inference is essential for developing and evaluating medical interventions.<n>Real-world medical datasets are often difficult to access due to regulatory barriers.<n>We present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine.
arXiv Detail & Related papers (2025-10-21T16:16:00Z)
Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era [49.46005489386284]
This tutorial introduces the foundations and latest advances in synthetic data generation.<n> Attendees will gain actionable insights into leveraging generative synthetic data to enhance data mining research and practice.
arXiv Detail & Related papers (2025-08-27T05:04:07Z)
VECT-GAN: A variationally encoded generative model for overcoming data scarcity in pharmaceutical science [32.92218213317144]
Existing datasets are often small and noisy, limiting their utility. We develop a generative model specifically designed for augmenting small, noisy datasets. We make our method, including VECT-GAN pre-trained on ChEMBL available as a pip package.
arXiv Detail & Related papers (2025-01-15T18:23:33Z)
Synthetic ECG Generation for Data Augmentation and Transfer Learning in Arrhythmia Classification [1.7614607439356635]
We explore the usefulness of synthetic data generated with different generative models from Deep Learning. We investigate the effects of transfer learning, by fine-tuning a synthetically pre-trained model and then adding increasing proportions of real data.
arXiv Detail & Related papers (2024-11-27T15:46:34Z)
Hierarchical Conditional Tabular GAN for Multi-Tabular Synthetic Data Generation [0.0]
We propose the algorithm HCTGAN to synthesize multi-tabular data from complex multi-tabular datasets. Our findings show that our proposed algorithm can more efficiently sample large amounts of synthetic data for deep and complex multi-tabular datasets. We conclude that the HCTGAN algorithm is suitable for generating large amounts of synthetic data efficiently for deep multi-tabular datasets with complex relationships.
arXiv Detail & Related papers (2024-11-11T14:09:26Z)
Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic. Our approach transforms numerical data into text, re-framing data generation as a language modeling task. Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z)
Drug Discovery SMILES-to-Pharmacokinetics Diffusion Models with Deep Molecular Understanding [1.4952056744888913]
Imagand is a novel SMILES-to-Pharmacokinetic (S2PK) diffusion model capable of generating an array of PK target properties conditioned on SMILES inputs. Imagand is a promising solution for data overlap sparsity and allows researchers to efficiently generate ligand PK data for drug discovery research.
arXiv Detail & Related papers (2024-08-14T16:01:02Z)
Unconditional Latent Diffusion Models Memorize Patient Imaging Data: Implications for Openly Sharing Synthetic Data [2.04850174048739]
We train latent diffusion models on CT, MR, and X-ray datasets for synthetic data generation. We then detect the amount of training data memorized utilizing our novel self-supervised copy detection approach. Our findings show a surprisingly high degree of patient data memorization across all datasets.
arXiv Detail & Related papers (2024-02-01T22:58:21Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances. We design fine-grained step-by-step instructions to obtain the initial data instances. Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Synthetic data generation for a longitudinal cohort study -- Evaluation, method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns. A promising alternative is the generation of fully synthetic data. In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z)
Drug Synergistic Combinations Predictions via Large-Scale Pre-Training and Graph Structure Learning [82.93806087715507]
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation. Deep learning models have emerged as an efficient way to discover synergistic combinations. Our framework achieves state-of-the-art results in comparison with other deep learning-based methods.
arXiv Detail & Related papers (2023-01-14T15:07:43Z)
ImDrug: A Benchmark for Deep Imbalanced Learning in AI-aided Drug Discovery [79.08833067391093]
Real-world pharmaceutical datasets often exhibit highly imbalanced distribution. We introduce ImDrug, a benchmark with an open-source Python library which consists of 4 imbalance settings, 11 AI-ready datasets, 54 learning tasks and 16 baseline algorithms tailored for imbalanced learning. It provides an accessible and customizable testbed for problems and solutions spanning a broad spectrum of the drug discovery pipeline.
arXiv Detail & Related papers (2022-09-16T13:35:57Z)
Unsupervised Domain Adaptive Learning via Synthetic Data for Person Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance. Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models. In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.