Private DNA Sequencing: Hiding Information in Discrete Noise
- URL: http://arxiv.org/abs/2101.12124v2
- Date: Mon, 04 Nov 2024 02:05:29 GMT
- Title: Private DNA Sequencing: Hiding Information in Discrete Noise
- Authors: Kayvon Mazooji, Roy Dong, Ilan Shomorony,
- Abstract summary: We study the problem of hiding a binary random variable $X$ with the additive noise provided by mixing DNA samples.
We characterize upper and lower bounds to the solution of this problem, which are empirically shown to be very close.
- Score: 6.647959476396793
- License:
- Abstract: When an individual's DNA is sequenced, sensitive medical information becomes available to the sequencing laboratory. A recently proposed way to hide an individual's genetic information is to mix in DNA samples of other individuals. We assume that the genetic content of these samples is known to the individual but unknown to the sequencing laboratory. Thus, these DNA samples act as "noise" to the sequencing laboratory, but still allow the individual to recover their own DNA samples afterward. Motivated by this idea, we study the problem of hiding a binary random variable $X$ (a genetic marker) with the additive noise provided by mixing DNA samples, using mutual information as a privacy metric. This is equivalent to the problem of finding a worst-case noise distribution for recovering $X$ from the noisy observation among a set of feasible discrete distributions. We characterize upper and lower bounds to the solution of this problem, which are empirically shown to be very close. The lower bound is obtained through a convex relaxation of the original discrete optimization problem, and yields a closed-form expression. The upper bound is computed via a greedy algorithm for selecting the mixing proportions.
Related papers
- Estimating Unknown Population Sizes Using the Hypergeometric Distribution [1.03590082373586]
We tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown.
We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable.
Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data.
arXiv Detail & Related papers (2024-02-22T01:53:56Z) - StyleGenes: Discrete and Efficient Latent Distributions for GANs [149.0290830305808]
We propose a discrete latent distribution for Generative Adversarial Networks (GANs)
Instead of drawing latent vectors from a continuous prior, we sample from a finite set of learnable latents.
We take inspiration from the encoding of information in biological organisms.
arXiv Detail & Related papers (2023-04-30T23:28:46Z) - How Does Pseudo-Labeling Affect the Generalization Error of the
Semi-Supervised Gibbs Algorithm? [73.80001705134147]
We provide an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm.
The gen-error is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset.
arXiv Detail & Related papers (2022-10-15T04:11:56Z) - Kernel Density Estimation by Genetic Algorithm [0.0]
Genetic algorithm generates multiple subsamples of a given size with replacement from the original sample.
dominant subsamples in terms of fitness values are inherited by the next generation.
arXiv Detail & Related papers (2022-03-03T06:16:18Z) - Saliency Grafting: Innocuous Attribution-Guided Mixup with Calibrated
Label Mixing [104.630875328668]
Mixup scheme suggests mixing a pair of samples to create an augmented training sample.
We present a novel, yet simple Mixup-variant that captures the best of both worlds.
arXiv Detail & Related papers (2021-12-16T11:27:48Z) - MURAL: An Unsupervised Random Forest-Based Embedding for Electronic
Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types.
MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random.
We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z) - Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy
Amplification by Shuffling [49.43288037509783]
We show that random shuffling amplifies differential privacy guarantees of locally randomized data.
Our result is based on a new approach that is simpler than previous work and extends to approximate differential privacy with nearly the same guarantees.
arXiv Detail & Related papers (2020-12-23T17:07:26Z) - DNA mixture deconvolution using an evolutionary algorithm with multiple
populations, hill-climbing, and guided mutation [0.8029049649310211]
DNA samples crime cases analysed in forensic genetics frequently contain DNA from multiple contributors.
In cases where one or more of the contributors were unknown, an objective of interest would be the separation, often called deconvolution, of these unknown profiles.
We introduced a multiple population evolutionary algorithm (MEA) to obtain deconvolutions of the unknown DNA profiles.
arXiv Detail & Related papers (2020-12-01T14:23:55Z) - RDP-GAN: A R\'enyi-Differential Privacy based Generative Adversarial
Network [75.81653258081435]
Generative adversarial network (GAN) has attracted increasing attention recently owing to its impressive ability to generate realistic samples with high privacy protection.
However, when GANs are applied on sensitive or private training examples, such as medical or financial records, it is still probable to divulge individuals' sensitive and private information.
We propose a R'enyi-differentially private-GAN (RDP-GAN), which achieves differential privacy (DP) in a GAN by carefully adding random noises on the value of the loss function during training.
arXiv Detail & Related papers (2020-07-04T09:51:02Z) - The Discrete Gaussian for Differential Privacy [23.977143445822897]
A key tool for building differentially private systems is adding Gaussian noise to the output of a function evaluated on a sensitive dataset.
Previous work has demonstrated that seemingly innocuous numerical errors can entirely destroy privacy.
We introduce and analyze the discrete Gaussian in the context of differential privacy.
arXiv Detail & Related papers (2020-03-31T18:00:00Z) - Minimax optimal goodness-of-fit testing for densities and multinomials
under a local differential privacy constraint [3.265773263570237]
We consider the consequences of local differential privacy constraints on goodness-of-fit testing.
We present a test that is adaptive to the smoothness parameter of the unknown density and remains minimax optimal up to a logarithmic factor.
arXiv Detail & Related papers (2020-02-11T08:41:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.