Towards Biologically Plausible and Private Gene Expression Data
Generation
- URL: http://arxiv.org/abs/2402.04912v1
- Date: Wed, 7 Feb 2024 14:39:11 GMT
- Title: Towards Biologically Plausible and Private Gene Expression Data
Generation
- Authors: Dingfan Chen, Marie Oestreich, Tejumade Afonja, Raouf Kerkouche,
Matthias Becker, Mario Fritz
- Abstract summary: Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications.
Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only for elementary metrics and relatively simple data distributions.
We initiate a systematic analysis of how DP generative models perform in their natural application scenarios, specifically focusing on real-world gene expression data.
- Score: 47.72947816788821
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative models trained with Differential Privacy (DP) are becoming
increasingly prominent in the creation of synthetic data for downstream
applications. Existing literature, however, primarily focuses on basic
benchmarking datasets and tends to report promising results only for elementary
metrics and relatively simple data distributions. In this paper, we initiate a
systematic analysis of how DP generative models perform in their natural
application scenarios, specifically focusing on real-world gene expression
data. We conduct a comprehensive analysis of five representative DP generation
methods, examining them from various angles, such as downstream utility,
statistical properties, and biological plausibility. Our extensive evaluation
illuminates the unique characteristics of each DP generation method, offering
critical insights into the strengths and weaknesses of each approach, and
uncovering intriguing possibilities for future developments. Perhaps
surprisingly, our analysis reveals that most methods are capable of achieving
seemingly reasonable downstream utility, according to the standard evaluation
metrics considered in existing literature. Nevertheless, we find that none of
the DP methods are able to accurately capture the biological characteristics of
the real dataset. This observation suggests a potential over-optimistic
assessment of current methodologies in this field and underscores a pressing
need for future enhancements in model design.
Related papers
- Emerging-properties Mapping Using Spatial Embedding Statistics: EMUSES [0.0]
EMUSES is an innovative approach to create high-dimensional embeddings that reveal latent structures within data.
By bridging the gap between predictive accuracy and interpretability, EMUSES offers researchers a powerful tool to understand the multifactorial origins of complex phenomena.
arXiv Detail & Related papers (2024-06-20T13:39:14Z) - Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models [69.06149482021071]
We propose a novel EHR data generation model called EHRPD.
It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation.
We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives.
arXiv Detail & Related papers (2024-06-20T02:20:23Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - Synthetic location trajectory generation using categorical diffusion
models [50.809683239937584]
Diffusion models (DPMs) have rapidly evolved to be one of the predominant generative models for the simulation of synthetic data.
We propose using DPMs for the generation of synthetic individual location trajectories (ILTs) which are sequences of variables representing physical locations visited by individuals.
arXiv Detail & Related papers (2024-02-19T15:57:39Z) - Heterogeneous Transfer Learning for Building High-Dimensional Generalized Linear Models with Disparate Datasets [0.0]
We describe a transfer learning approach for building high-dimensional generalized linear models.
We use data from a main study with detailed information on all predictors and an external, potentially much larger, study that has a more limited set of predictors.
arXiv Detail & Related papers (2023-12-20T06:11:59Z) - Synthetic data generation for a longitudinal cohort study -- Evaluation,
method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns.
A promising alternative is the generation of fully synthetic data.
In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z) - Artificial Text Detection via Examining the Topology of Attention Maps [58.46367297712477]
We propose three novel types of interpretable topological features for this task based on Topological Data Analysis (TDA)
We empirically show that the features derived from the BERT model outperform count- and neural-based baselines up to 10% on three common datasets.
The probing analysis of the features reveals their sensitivity to the surface and syntactic properties.
arXiv Detail & Related papers (2021-09-10T12:13:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.