Probing Unlearned Diffusion Models: A Transferable Adversarial Attack Perspective
- URL: http://arxiv.org/abs/2404.19382v1
- Date: Tue, 30 Apr 2024 09:14:54 GMT
- Title: Probing Unlearned Diffusion Models: A Transferable Adversarial Attack Perspective
- Authors: Xiaoxuan Han, Songlin Yang, Wei Wang, Yang Li, Jing Dong,
- Abstract summary: Unlearning methods have been developed to erase concepts from diffusion models.
This paper aims to leverage the transferability of the adversarial attack to probe the unlearning robustness under a black-box setting.
Specifically, we employ an adversarial search strategy to search for the adversarial embedding which can transfer across different unlearned models.
- Score: 20.263233740360022
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advanced text-to-image diffusion models raise safety concerns regarding identity privacy violation, copyright infringement, and Not Safe For Work content generation. Towards this, unlearning methods have been developed to erase these involved concepts from diffusion models. However, these unlearning methods only shift the text-to-image mapping and preserve the visual content within the generative space of diffusion models, leaving a fatal flaw for restoring these erased concepts. This erasure trustworthiness problem needs probe, but previous methods are sub-optimal from two perspectives: (1) Lack of transferability: Some methods operate within a white-box setting, requiring access to the unlearned model. And the learned adversarial input often fails to transfer to other unlearned models for concept restoration; (2) Limited attack: The prompt-level methods struggle to restore narrow concepts from unlearned models, such as celebrity identity. Therefore, this paper aims to leverage the transferability of the adversarial attack to probe the unlearning robustness under a black-box setting. This challenging scenario assumes that the unlearning method is unknown and the unlearned model is inaccessible for optimization, requiring the attack to be capable of transferring across different unlearned models. Specifically, we employ an adversarial search strategy to search for the adversarial embedding which can transfer across different unlearned models. This strategy adopts the original Stable Diffusion model as a surrogate model to iteratively erase and search for embeddings, enabling it to find the embedding that can restore the target concept for different unlearning methods. Extensive experiments demonstrate the transferability of the searched adversarial embedding across several state-of-the-art unlearning methods and its effectiveness for different levels of concepts.
Related papers
- Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models [13.479224197351673]
We show that fine-tuning a text-to-image diffusion model on seemingly unrelated images can cause it to "relearn" concepts that were previously "unlearned"
Our findings underscore the fragility of composing incremental model updates.
arXiv Detail & Related papers (2024-10-10T16:10:27Z) - Score Forgetting Distillation: A Swift, Data-Free Method for Machine Unlearning in Diffusion Models [63.43422118066493]
Machine unlearning (MU) is a crucial foundation for developing safe, secure, and trustworthy GenAI models.
Traditional MU methods often rely on stringent assumptions and require access to real data.
This paper introduces Score Forgetting Distillation (SFD), an innovative MU approach that promotes the forgetting of undesirable information in diffusion models.
arXiv Detail & Related papers (2024-09-17T14:12:50Z) - Unlearning or Concealment? A Critical Analysis and Evaluation Metrics for Unlearning in Diffusion Models [7.9993879763024065]
We show that the objective functions used for unlearning in the existing methods lead to decoupling of the targeted concepts for the corresponding prompts.
The ineffectiveness of current methods stems primarily from their narrow focus on reducing generation probabilities for specific prompt sets.
We introduce two new evaluation metrics: Concept Retrieval Score (CRS) and Concept Confidence Score (CCS)
arXiv Detail & Related papers (2024-09-09T14:38:31Z) - Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks.
We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z) - Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts.
The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts.
concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z) - Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient [20.091446060893638]
This paper proposes a concept domain correction framework for unlearning concepts in diffusion models.
By aligning the output domains of sensitive concepts and anchor concepts through adversarial training, we enhance the generalizability of the unlearning results.
arXiv Detail & Related papers (2024-05-24T07:47:36Z) - Segue: Side-information Guided Generative Unlearnable Examples for
Facial Privacy Protection in Real World [64.4289385463226]
We propose Segue: Side-information guided generative unlearnable examples.
To improve transferability, we introduce side information such as true labels and pseudo labels.
It can resist JPEG compression, adversarial training, and some standard data augmentations.
arXiv Detail & Related papers (2023-10-24T06:22:37Z) - Ablating Concepts in Text-to-Image Diffusion Models [57.9371041022838]
Large-scale text-to-image diffusion models can generate high-fidelity images with powerful compositional ability.
These models are typically trained on an enormous amount of Internet data, often containing copyrighted material, licensed images, and personal photos.
We propose an efficient method of ablating concepts in the pretrained model, preventing the generation of a target concept.
arXiv Detail & Related papers (2023-03-23T17:59:42Z) - Learning to Learn Transferable Attack [77.67399621530052]
Transfer adversarial attack is a non-trivial black-box adversarial attack that aims to craft adversarial perturbations on the surrogate model and then apply such perturbations to the victim model.
We propose a Learning to Learn Transferable Attack (LLTA) method, which makes the adversarial perturbations more generalized via learning from both data and model augmentation.
Empirical results on the widely-used dataset demonstrate the effectiveness of our attack method with a 12.85% higher success rate of transfer attack compared with the state-of-the-art methods.
arXiv Detail & Related papers (2021-12-10T07:24:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.