Quantifying Sample Anonymity in Score-Based Generative Models with
Adversarial Fingerprinting
- URL: http://arxiv.org/abs/2306.01363v1
- Date: Fri, 2 Jun 2023 08:37:38 GMT
- Title: Quantifying Sample Anonymity in Score-Based Generative Models with
Adversarial Fingerprinting
- Authors: Mischa Dombrowski and Bernhard Kainz
- Abstract summary: Training diffusion models on private data and disseminating the models and weights rather than the raw dataset paves the way for innovative large-scale data-sharing strategies.
This paper introduces a method for estimating the upper bound of the probability of reproducing identifiable training images during the sampling process.
Our results show that privacy-breaching images are reproduced at sampling time if the models were trained without care.
- Score: 3.8933108317492167
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in score-based generative models have led to a huge spike in
the development of downstream applications using generative models ranging from
data augmentation over image and video generation to anomaly detection. Despite
publicly available trained models, their potential to be used for privacy
preserving data sharing has not been fully explored yet. Training diffusion
models on private data and disseminating the models and weights rather than the
raw dataset paves the way for innovative large-scale data-sharing strategies,
particularly in healthcare, where safeguarding patients' personal health
information is paramount. However, publishing such models without individual
consent of, e.g., the patients from whom the data was acquired, necessitates
guarantees that identifiable training samples will never be reproduced, thus
protecting personal health data and satisfying the requirements of policymakers
and regulatory bodies. This paper introduces a method for estimating the upper
bound of the probability of reproducing identifiable training images during the
sampling process. This is achieved by designing an adversarial approach that
searches for anatomic fingerprints, such as medical devices or dermal art,
which could potentially be employed to re-identify training images. Our method
harnesses the learned score-based model to estimate the probability of the
entire subspace of the score function that may be utilized for one-to-one
reproduction of training samples. To validate our estimates, we generate
anomalies containing a fingerprint and investigate whether generated samples
from trained generative models can be uniquely mapped to the original training
samples. Overall our results show that privacy-breaching images are reproduced
at sampling time if the models were trained without care.
Related papers
- Towards Reliable Verification of Unauthorized Data Usage in Personalized Text-to-Image Diffusion Models [23.09033991200197]
New personalization techniques have been proposed to customize the pre-trained base models for crafting images with specific themes or styles.
Such a lightweight solution poses a new concern regarding whether the personalized models are trained from unauthorized data.
We introduce SIREN, a novel methodology to proactively trace unauthorized data usage in black-box personalized text-to-image diffusion models.
arXiv Detail & Related papers (2024-10-14T12:29:23Z) - Training Data Attribution: Was Your Model Secretly Trained On Data Created By Mine? [17.714589429503675]
We propose an injection-free training data attribution method for text-to-image models.
Our approach involves developing algorithms to uncover distinct samples and using them as inherent watermarks.
Our experiments demonstrate that our method achieves an accuracy of over 80% in identifying the source of a suspicious model's training data.
arXiv Detail & Related papers (2024-09-24T06:23:43Z) - Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models [69.06149482021071]
We propose a novel EHR data generation model called EHRPD.
It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation.
We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives.
arXiv Detail & Related papers (2024-06-20T02:20:23Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Investigating Data Memorization in 3D Latent Diffusion Models for
Medical Image Synthesis [0.6382686594288781]
We assess the memorization capacity of 3D latent diffusion models on photon-counting coronary computed tomography angiography and knee magnetic resonance imaging datasets.
Our results suggest that such latent diffusion models indeed memorize training data, and there is a dire need for devising strategies to mitigate memorization.
arXiv Detail & Related papers (2023-07-03T16:39:28Z) - Private Gradient Estimation is Useful for Generative Modeling [25.777591229903596]
We present a new private generative modeling approach where samples are generated via Hamiltonian dynamics with gradients of the private dataset estimated by a well-trained network.
Our model is able to generate data with a resolution of 256x256.
arXiv Detail & Related papers (2023-05-18T02:51:17Z) - Reconstructing Training Data from Model Gradient, Provably [68.21082086264555]
We reconstruct the training samples from a single gradient query at a randomly chosen parameter value.
As a provable attack that reveals sensitive training data, our findings suggest potential severe threats to privacy.
arXiv Detail & Related papers (2022-12-07T15:32:22Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Leveraging Adversarial Examples to Quantify Membership Information
Leakage [30.55736840515317]
We develop a novel approach to address the problem of membership inference in pattern recognition models.
We argue that this quantity reflects the likelihood of belonging to the training data.
Our method performs comparable or even outperforms state-of-the-art strategies.
arXiv Detail & Related papers (2022-03-17T19:09:38Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Automatic Recall Machines: Internal Replay, Continual Learning and the
Brain [104.38824285741248]
Replay in neural networks involves training on sequential data with memorized samples, which counteracts forgetting of previous behavior caused by non-stationarity.
We present a method where these auxiliary samples are generated on the fly, given only the model that is being trained for the assessed objective.
Instead the implicit memory of learned samples within the assessed model itself is exploited.
arXiv Detail & Related papers (2020-06-22T15:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.