Related papers: Towards a Theoretical Understanding of Memorization in Diffusion Models

Towards a Theoretical Understanding of Memorization in Diffusion Models

URL: http://arxiv.org/abs/2410.02467v4
Date: Sun, 13 Oct 2024 16:51:04 GMT
Title: Towards a Theoretical Understanding of Memorization in Diffusion Models
Authors: Yunhao Chen, Xingjun Ma, Difan Zou, Yu-Gang Jiang,
Abstract summary: Diffusion probabilistic models (DPMs) are being employed as mainstream models for Generative Artificial Intelligence (GenAI) We provide a theoretical understanding of memorization in both conditional and unconditional DPMs under the assumption of model convergence. We propose a novel data extraction method named textbfSurrogate condItional Data Extraction (SIDE) that leverages a time-dependent classifier trained on the generated data as a surrogate condition to extract training data from unconditional DPMs.
Score: 76.85077961718875
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As diffusion probabilistic models (DPMs) are being employed as mainstream models for Generative Artificial Intelligence (GenAI), the study of their memorization of training data has attracted growing attention. Existing works in this direction aim to establish an understanding of whether or to what extent DPMs learn via memorization. Such an understanding is crucial for identifying potential risks of data leakage and copyright infringement in diffusion models and, more importantly, for trustworthy application of GenAI. Existing works revealed that conditional DPMs are more prone to training data memorization than unconditional DPMs, and the motivated data extraction methods are mostly for conditional DPMs. However, these understandings are primarily empirical, and extracting training data from unconditional models has been found to be extremely challenging. In this work, we provide a theoretical understanding of memorization in both conditional and unconditional DPMs under the assumption of model convergence. Our theoretical analysis indicates that extracting data from unconditional models can also be effective by constructing a proper surrogate condition. Based on this result, we propose a novel data extraction method named \textbf{Surrogate condItional Data Extraction (SIDE)} that leverages a time-dependent classifier trained on the generated data as a surrogate condition to extract training data from unconditional DPMs. Empirical results demonstrate that our SIDE can extract training data in challenging scenarios where previous methods fail, and it is, on average, over 50\% more effective across different scales of the CelebA dataset.

Related papers

MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models [56.09354775405601]
Model extraction attacks aim to replicate the functionality of a black-box model through query access.<n>Most existing defenses presume that attacker queries have out-of-distribution (OOD) samples, enabling them to detect and disrupt suspicious inputs.<n>We propose MISLEADER, a novel defense strategy that does not rely on OOD assumptions.
arXiv Detail & Related papers (2025-06-03T01:37:09Z)
Variance-Based Defense Against Blended Backdoor Attacks [0.0]
Backdoor attacks represent a subtle yet effective class of cyberattacks targeting AI models.<n>We propose a novel defense method that trains a model on the given dataset, detects poisoned classes, and extracts the critical part of the attack trigger.
arXiv Detail & Related papers (2025-06-02T09:01:35Z)
No Query, No Access [50.18709429731724]
We introduce the textbfVictim Data-based Adrial Attack (VDBA), which operates using only victim texts.<n>To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods.<n>Experiments on the Emotion and SST5 datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR improvement of 52.08%.
arXiv Detail & Related papers (2025-05-12T06:19:59Z)
Improved Diffusion-based Generative Model with Better Adversarial Robustness [65.38540020916432]
Diffusion Probabilistic Models (DPMs) have achieved significant success in generative tasks. During the denoising process, the input data distributions differ between the training and inference stages.
arXiv Detail & Related papers (2025-02-24T12:29:16Z)
Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)<n>To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z)
Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization [30.738229850748137]
MolPeg is a Molecular data Pruning framework for enhanced Generalization. It focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models. It consistently outperforms existing DP methods across four downstream tasks.
arXiv Detail & Related papers (2024-09-02T09:06:04Z)
Extracting Training Data from Unconditional Diffusion Models [76.85077961718875]
diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI) We aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization. Based on the theoretical analysis, we propose a novel data extraction method called textbfSurrogate condItional Data Extraction (SIDE) that leverages a trained on generated data as a surrogate condition to extract training data directly from unconditional diffusion models.
arXiv Detail & Related papers (2024-06-18T16:20:12Z)
Self-Consistency Training for Density-Functional-Theory Hamiltonian Prediction [74.84850523400873]
We show that Hamiltonian prediction possesses a self-consistency principle, based on which we propose self-consistency training. It enables the model to be trained on a large amount of unlabeled data, hence addresses the data scarcity challenge. It is more efficient than running DFT to generate labels for supervised training, since it amortizes DFT calculation over a set of queries.
arXiv Detail & Related papers (2024-03-14T16:52:57Z)
Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity [80.16488817177182]
GNNs are vulnerable to the model stealing attack, a nefarious endeavor geared towards duplicating the target model via query permissions. We introduce three model stealing attacks to adapt to different actual scenarios.
arXiv Detail & Related papers (2023-12-18T05:42:31Z)
Intriguing Properties of Data Attribution on Diffusion Models [33.77847454043439]
Data attribution seeks to trace desired outputs back to training data. Data attribution has become a module to properly assign for high-intuitive or copyrighted data.
arXiv Detail & Related papers (2023-11-01T13:00:46Z)
SCME: A Self-Contrastive Method for Data-free and Query-Limited Model Extraction Attack [18.998300969035885]
Model extraction attacks fool the target model by generating adversarial examples on a substitute model. We propose a novel data-free model extraction method named SCME, which considers both the inter- and intra-class diversity in synthesizing fake data.
arXiv Detail & Related papers (2023-10-15T10:41:45Z)
On Memorization in Diffusion Models [46.656797890144105]
We show that memorization behaviors tend to occur on smaller-sized datasets. We quantify the impact of the influential factors on these memorization behaviors in terms of effective model memorization (EMM) Our study holds practical significance for diffusion model users and offers clues to theoretical research in deep generative models.
arXiv Detail & Related papers (2023-10-04T09:04:20Z)
Diffusion Model as Representation Learner [86.09969334071478]
Diffusion Probabilistic Models (DPMs) have recently demonstrated impressive results on various generative tasks. We propose a novel knowledge transfer method that leverages the knowledge acquired by DPMs for recognition tasks.
arXiv Detail & Related papers (2023-08-21T00:38:39Z)
Deep Generative Modeling-based Data Augmentation with Demonstration using the BFBT Benchmark Void Fraction Datasets [3.341975883864341]
This paper explores the applications of deep generative models (DGMs) that have been widely used for image data generation to scientific data augmentation. Once trained, DGMs can be used to generate synthetic data that are similar to the training data and significantly expand the dataset size.
arXiv Detail & Related papers (2023-08-19T22:19:41Z)
Synthetic Health-related Longitudinal Data with Mixed-type Variables Generated using Diffusion Models [2.140861702387444]
This paper presents a novel approach to simulating electronic health records using diffusion probabilistic models (DPMs) We demonstrate the effectiveness of DPMs in synthesising longitudinal EHRs that capture mixed-type variables, including numeric, binary, and categorical variables.
arXiv Detail & Related papers (2023-03-22T03:15:33Z)
On Calibrating Diffusion Probabilistic Models [78.75538484265292]
diffusion probabilistic models (DPMs) have achieved promising results in diverse generative tasks. We propose a simple way for calibrating an arbitrary pretrained DPM, with which the score matching loss can be reduced and the lower bounds of model likelihood can be increased. Our calibration method is performed only once and the resulting models can be used repeatedly for sampling.
arXiv Detail & Related papers (2023-02-21T14:14:40Z)
DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models [42.58375679841317]
We propose a new task, disentanglement of Diffusion Probabilistic Models (DPMs) The task is to automatically discover the inherent factors behind the observations and disentangle the gradient fields of DPM into sub-gradient fields. We devise an unsupervised approach named DisDiff, achieving disentangled representation learning in the framework of DPMs.
arXiv Detail & Related papers (2023-01-31T15:58:32Z)
Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER) Our method exploits self-supervised pretraining to learn good feature representations from the target data. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z)
CausalAgents: A Robustness Benchmark for Motion Forecasting using Causal Relationships [8.679073301435265]
We construct a new benchmark for evaluating and improving model robustness by applying perturbations to existing data. We use these labels to perturb the data by deleting non-causal agents from the scene. Under non-causal perturbations, we observe a $25$-$38%$ relative change in minADE as compared to the original.
arXiv Detail & Related papers (2022-07-07T21:28:23Z)
SSM-DTA: Breaking the Barriers of Data Scarcity in Drug-Target Affinity Prediction [127.43571146741984]
Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery. wet experiments remain the most reliable method, but they are time-consuming and resource-intensive. Existing methods have primarily focused on developing techniques based on the available DTA data, without adequately addressing the data scarcity issue. We present the SSM-DTA framework, which incorporates three simple yet highly effective strategies.
arXiv Detail & Related papers (2022-06-20T14:53:25Z)
Prompting to Distill: Boosting Data-Free Knowledge Distillation via Reinforced Prompt [52.6946016535059]
Data-free knowledge distillation (DFKD) conducts knowledge distillation via eliminating the dependence of original training data. We propose a prompt-based method, termed as PromptDFD, that allows us to take advantage of learned language priors. As shown in our experiments, the proposed method substantially improves the synthesis quality and achieves considerable improvements on distillation performance.
arXiv Detail & Related papers (2022-05-16T08:56:53Z)
Data-SUITE: Data-centric identification of in-distribution incongruous examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data. We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Incorporating Causal Graphical Prior Knowledge into Predictive Modeling via Simple Data Augmentation [92.96204497841032]
Causal graphs (CGs) are compact representations of the knowledge of the data generating processes behind the data distributions. We propose a model-agnostic data augmentation method that allows us to exploit the prior knowledge of the conditional independence (CI) relations. We experimentally show that the proposed method is effective in improving the prediction accuracy, especially in the small-data regime.
arXiv Detail & Related papers (2021-02-27T06:13:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.