Reproducibility-Oriented and Privacy-Preserving Genomic Dataset Sharing
- URL: http://arxiv.org/abs/2209.06327v5
- Date: Wed, 28 Aug 2024 15:24:07 GMT
- Title: Reproducibility-Oriented and Privacy-Preserving Genomic Dataset Sharing
- Authors: Yuzhou Jiang, Tianxi Ji, Pan Li, Erman Ayday,
- Abstract summary: We propose an innovative method that involves a differential privacy-based scheme for sharing genomic datasets.
We show that our proposed scheme outperforms all other methods in detecting GWAS outcome errors, achieves better utility, and provides higher privacy protection against membership inference attacks (MIAs)
By utilizing our method, genomic researchers will be inclined to share a differentially private, yet of high quality version of their datasets.
- Score: 8.959228247984337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As genomic research has become increasingly widespread in recent years, few studies have shared datasets due to the privacy concerns about the genomic records. This hinders the reproduction and validation of research outcomes, which are crucial for catching errors, e.g., miscalculations, during the research process. To address the reproducibility issue of genome-wide association studies (GWAS) outcomes, we propose an innovative method that involves a differential privacy-based scheme for sharing genomic datasets. The proposed scheme involves two stages. In the first stage, we generate a noisy copy of the target dataset by applying an optimized version of a previously proposed XOR mechanism on the binarized (encoded) dataset, where the binary noise generation considers biological features. However, the initial step introduces significant noise, making the dataset less suitable for direct GWAS outcome validation. Thus, in the second stage, we implement a post-processing technique that adjusts the Minor Allele Frequency values (MAFs) in the noisy dataset to align more closely with public MAF information using optimal transport, and then decode it back to genomic space. We evaluate the proposed scheme on three real-life genomic datasets and compare it with a baseline approach (local differential privacy) and two synthesis-based solutions with regard to GWAS outcome validation, data utility, and resistance against membership inference attacks (MIAs). We show that our proposed scheme outperforms all other methods in detecting GWAS outcome errors, achieves better utility, and provides higher privacy protection against membership inference attacks (MIAs). By utilizing our method, genomic researchers will be inclined to share a differentially private, yet of high quality version of their datasets.
Related papers
- Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation [13.009945735929445]
We propose a novel paradigm to solve salient problems plaguing the Automatic Speech Recognition field.
In the first stage, multiple acoustic models are trained based upon different subsets of the complete speech data.
In the second stage, two novel algorithms are utilized to generate a high-quality acoustic model.
arXiv Detail & Related papers (2024-10-21T03:48:23Z) - PP-GWAS: Privacy Preserving Multi-Site Genome-wide Association Studies [2.516577526761521]
We present a novel algorithm PP-GWAS designed to improve upon existing standards in terms of computational efficiency and scalability without sacrificing data privacy.
Experimental evaluation with real world and synthetic data indicates that PP-GWAS can achieve computational speeds twice as fast as similar state-of-the-art algorithms.
We have assessed its performance using various datasets, emphasizing its potential in facilitating more efficient and private genomic analyses.
arXiv Detail & Related papers (2024-10-10T17:07:57Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.
RAG systems may face severe privacy risks when retrieving private data.
We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge.
Existing private generative models are struggling with the utility of synthetic samples.
We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z) - Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets [83.749895930242]
We propose two techniques for producing high-quality naturalistic synthetic occluded faces.
We empirically show the effectiveness and robustness of both methods, even for unseen occlusions.
We present two high-resolution real-world occluded face datasets with fine-grained annotations, RealOcc and RealOcc-Wild.
arXiv Detail & Related papers (2022-05-12T17:03:57Z) - DRFLM: Distributionally Robust Federated Learning with Inter-client
Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data.
We propose a general framework to solve the above two challenges simultaneously.
We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z) - Non-stationary Gaussian process discriminant analysis with variable
selection for high-dimensional functional data [0.0]
High-dimensional classification and feature selection are ubiquitous with the recent advancement in data acquisition technology.
These structures pose additional challenges to commonly used methods that rely mainly on a two-stage approach performing variable selection and classification separately.
We propose in this work a novel Gaussian process discriminant analysis (GPDA) that combines these steps in a unified framework.
arXiv Detail & Related papers (2021-09-29T03:35:49Z) - Iterative Methods for Private Synthetic Data: Unifying Framework and New
Methods [18.317488965846636]
We study private synthetic data generation for query release.
The goal is to construct a sanitized version of a sensitive dataset subject to differential privacy.
Under this framework, we propose two new methods.
arXiv Detail & Related papers (2021-06-14T04:19:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.