VECT-GAN: A variationally encoded generative model for overcoming data scarcity in pharmaceutical science
- URL: http://arxiv.org/abs/2501.08995v2
- Date: Fri, 17 Jan 2025 08:58:48 GMT
- Title: VECT-GAN: A variationally encoded generative model for overcoming data scarcity in pharmaceutical science
- Authors: Youssef Abdalla, Marrisa Taub, Eleanor Hilton, Priya Akkaraju, Alexander Milanovic, Mine Orlu, Abdul W. Basit, Michael T Cook, Tapabrata Chakraborti, David Shorthouse,
- Abstract summary: Existing datasets are often small and noisy, limiting their utility.
We develop a generative model specifically designed for augmenting small, noisy datasets.
We make our method, including VECT-GAN pre-trained on ChEMBL available as a pip package.
- Score: 32.92218213317144
- License:
- Abstract: Data scarcity in pharmaceutical research has led to reliance on labour-intensive trial-and-error approaches for development rather than data-driven methods. While Machine Learning offers a solution, existing datasets are often small and noisy, limiting their utility. To address this, we developed a Variationally Encoded Conditional Tabular Generative Adversarial Network (VECT-GAN), a novel generative model specifically designed for augmenting small, noisy datasets. We introduce a pipeline where data is augmented before regression model development and demonstrate that this consistently and significantly improves performance over other state-of-the-art tabular generative models. We apply this pipeline across six pharmaceutical datasets, and highlight its real-world applicability by developing novel polymers with medically desirable mucoadhesive properties, which we made and experimentally characterised. Additionally, we pre-train the model on the ChEMBL database of drug-like molecules, leveraging knowledge distillation to enhance its generalisability, making it readily available for use on pharmaceutical datasets containing small molecules, an extremely common pharmaceutical task. We demonstrate the power of synthetic data for regularising small tabular datasets, highlighting its potential to become standard practice in pharmaceutical model development, and make our method, including VECT-GAN pre-trained on ChEMBL available as a pip package.
Related papers
- Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models [69.06149482021071]
We propose a novel EHR data generation model called EHRPD.
It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation.
We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives.
arXiv Detail & Related papers (2024-06-20T02:20:23Z) - Synthetic Data from Diffusion Models Improve Drug Discovery Prediction [1.3686993145787065]
Data sparsity makes data curation difficult for researchers looking to answer key research questions.
We propose a novel diffusion GNN model Syngand capable of generating ligand and pharmacokinetic data end-to-end.
We show initial promising results on the efficacy of the Syngand-generated synthetic target property data on downstream regression tasks with AqSolDB, LD50, and hERG central.
arXiv Detail & Related papers (2024-05-06T19:09:37Z) - Discovering intrinsic multi-compartment pharmacometric models using Physics Informed Neural Networks [0.0]
We introduce PKINNs, a novel purely data-driven neural network model.
PKINNs efficiently discovers and models intrinsic multi-compartment-based pharmacometric structures.
The resulting models are both interpretable and explainable through Symbolic Regression methods.
arXiv Detail & Related papers (2024-04-30T19:31:31Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - From Artificially Real to Real: Leveraging Pseudo Data from Large
Language Models for Low-Resource Molecule Discovery [35.5507452011217]
Cross-modal techniques for molecule discovery frequently encounter the issue of data scarcity, hampering their performance and application.
We introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data.
Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model scale, reduced data size and lower training cost.
arXiv Detail & Related papers (2023-09-11T02:35:36Z) - Drug Discovery under Covariate Shift with Domain-Informed Prior
Distributions over Functions [30.305418761024143]
Real-world drug discovery tasks are often characterized by a scarcity of labeled data and a significant range of data.
We present a principled way to encode explicit prior knowledge of the data-generating process into a prior distribution.
We demonstrate that using integrate Q-SAVI to contextualize prior knowledgelike chemical space into the modeling process affords substantial accuracy and calibration.
arXiv Detail & Related papers (2023-07-14T05:01:10Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Synthesizing Mixed-type Electronic Health Records using Diffusion Models [10.973115905786129]
Synthetic data generation is a promising solution to mitigate privacy concerns when sharing sensitive patient information.
Recent studies have shown that diffusion models offer several advantages over GANs, such as generation of more realistic synthetic data and stable training in generating data modalities, including image, text, and sound.
Our experiments demonstrate that TabDDPM outperforms the state-of-the-art models across all evaluation metrics, except for privacy, which confirms the trade-off between privacy and utility.
arXiv Detail & Related papers (2023-02-28T15:42:30Z) - Drug Synergistic Combinations Predictions via Large-Scale Pre-Training
and Graph Structure Learning [82.93806087715507]
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation.
Deep learning models have emerged as an efficient way to discover synergistic combinations.
Our framework achieves state-of-the-art results in comparison with other deep learning-based methods.
arXiv Detail & Related papers (2023-01-14T15:07:43Z) - Retrieval-based Controllable Molecule Generation [63.44583084888342]
We propose a new retrieval-based framework for controllable molecule generation.
We use a small set of molecules to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria.
Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning.
arXiv Detail & Related papers (2022-08-23T17:01:16Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.