Extracting Training Data from Unconditional Diffusion Models
- URL: http://arxiv.org/abs/2410.02467v5
- Date: Thu, 28 Nov 2024 10:54:10 GMT
- Title: Extracting Training Data from Unconditional Diffusion Models
- Authors: Yunhao Chen, Shujie Wang, Difan Zou, Xingjun Ma,
- Abstract summary: Diffusion probabilistic models (DPMs) are being employed as mainstream models for Generative Artificial Intelligence (GenAI)
We propose a novel data extraction method named textbfSurrogate condItional Data Extraction (SIDE) that leverages a time-dependent trained on generated data as surrogate conditions to extract training data from unconditional DPMs.
- Score: 32.18993348942877
- License:
- Abstract: As diffusion probabilistic models (DPMs) are being employed as mainstream models for Generative Artificial Intelligence (GenAI), the study of their memorization has attracted growing attention. Existing works in this field aim to establish an understanding of whether or to what extent DPMs learn via memorization. Such an understanding is crucial for identifying potential risks of data leakage and copyright infringement in diffusion models and, more importantly, for trustworthy application of GenAI. Existing works revealed that conditional DPMs are more prone to memorize training data than unconditional DPMs. And most data extraction methods developed so far target conditional DPMs. Although unconditional DPMs are less prone to data extraction, further investigation into these attacks remains essential since they serve as the foundation for conditional models like Stable Diffusion, and exploring these attacks will enhance our understanding of memorization in DPMs. In this work, we propose a novel data extraction method named \textbf{Surrogate condItional Data Extraction (SIDE)} that leverages a time-dependent classifier trained on generated data as surrogate conditions to extract training data from unconditional DPMs. Empirical results demonstrate that it can extract training data in challenging scenarios where previous methods fail, and it is, on average, over 50\% more effective across different scales of the CelebA dataset. Furthermore, we provide a theoretical understanding of memorization in both conditional and unconditional DPMs and why SIDE is effective.
Related papers
- Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization [30.738229850748137]
MolPeg is a Molecular data Pruning framework for enhanced Generalization.
It focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models.
It consistently outperforms existing DP methods across four downstream tasks.
arXiv Detail & Related papers (2024-09-02T09:06:04Z) - Extracting Training Data from Unconditional Diffusion Models [76.85077961718875]
diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI)
We aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization.
Based on the theoretical analysis, we propose a novel data extraction method called textbfSurrogate condItional Data Extraction (SIDE) that leverages a trained on generated data as a surrogate condition to extract training data directly from unconditional diffusion models.
arXiv Detail & Related papers (2024-06-18T16:20:12Z) - ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic
Diffusion Models [69.9178140563928]
Colonoscopy analysis is essential for assisting clinical diagnosis and treatment.
The scarcity of annotated data limits the effectiveness and generalization of existing methods.
We propose an Adaptive Refinement Semantic Diffusion Model (ArSDM) to generate colonoscopy images that benefit the downstream tasks.
arXiv Detail & Related papers (2023-09-03T07:55:46Z) - Diffusion Model as Representation Learner [86.09969334071478]
Diffusion Probabilistic Models (DPMs) have recently demonstrated impressive results on various generative tasks.
We propose a novel knowledge transfer method that leverages the knowledge acquired by DPMs for recognition tasks.
arXiv Detail & Related papers (2023-08-21T00:38:39Z) - Synthetic Health-related Longitudinal Data with Mixed-type Variables
Generated using Diffusion Models [2.140861702387444]
This paper presents a novel approach to simulating electronic health records using diffusion probabilistic models (DPMs)
We demonstrate the effectiveness of DPMs in synthesising longitudinal EHRs that capture mixed-type variables, including numeric, binary, and categorical variables.
arXiv Detail & Related papers (2023-03-22T03:15:33Z) - On Calibrating Diffusion Probabilistic Models [78.75538484265292]
diffusion probabilistic models (DPMs) have achieved promising results in diverse generative tasks.
We propose a simple way for calibrating an arbitrary pretrained DPM, with which the score matching loss can be reduced and the lower bounds of model likelihood can be increased.
Our calibration method is performed only once and the resulting models can be used repeatedly for sampling.
arXiv Detail & Related papers (2023-02-21T14:14:40Z) - DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models [42.58375679841317]
We propose a new task, disentanglement of Diffusion Probabilistic Models (DPMs)
The task is to automatically discover the inherent factors behind the observations and disentangle the gradient fields of DPM into sub-gradient fields.
We devise an unsupervised approach named DisDiff, achieving disentangled representation learning in the framework of DPMs.
arXiv Detail & Related papers (2023-01-31T15:58:32Z) - SSM-DTA: Breaking the Barriers of Data Scarcity in Drug-Target Affinity
Prediction [127.43571146741984]
Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery.
wet experiments remain the most reliable method, but they are time-consuming and resource-intensive.
Existing methods have primarily focused on developing techniques based on the available DTA data, without adequately addressing the data scarcity issue.
We present the SSM-DTA framework, which incorporates three simple yet highly effective strategies.
arXiv Detail & Related papers (2022-06-20T14:53:25Z) - Prompting to Distill: Boosting Data-Free Knowledge Distillation via
Reinforced Prompt [52.6946016535059]
Data-free knowledge distillation (DFKD) conducts knowledge distillation via eliminating the dependence of original training data.
We propose a prompt-based method, termed as PromptDFD, that allows us to take advantage of learned language priors.
As shown in our experiments, the proposed method substantially improves the synthesis quality and achieves considerable improvements on distillation performance.
arXiv Detail & Related papers (2022-05-16T08:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.