Related papers: On the Vulnerability of Concept Erasure in Diffusion Models

On the Vulnerability of Concept Erasure in Diffusion Models

URL: http://arxiv.org/abs/2502.17537v1
Date: Mon, 24 Feb 2025 17:26:01 GMT
Title: On the Vulnerability of Concept Erasure in Diffusion Models
Authors: Lucas Beerens, Alex D. Richardson, Kaicheng Zhang, Dongdong Chen,
Abstract summary: Research on machine unlearning has developed various concept erasure methods, which aim to remove the effect of unwanted data through post-hoc training.<n>We show these erasure techniques are vulnerable, where images of supposedly erased concepts can still be generated using adversarially crafted prompts.<n>We introduce RECORD, a coordinate-descent-based algorithm that discovers prompts capable of eliciting the generation of erased content.
Score: 13.916443687966039
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The proliferation of text-to-image diffusion models has raised significant privacy and security concerns, particularly regarding the generation of copyrighted or harmful images. To address these issues, research on machine unlearning has developed various concept erasure methods, which aim to remove the effect of unwanted data through post-hoc training. However, we show these erasure techniques are vulnerable, where images of supposedly erased concepts can still be generated using adversarially crafted prompts. We introduce RECORD, a coordinate-descent-based algorithm that discovers prompts capable of eliciting the generation of erased content. We demonstrate that RECORD significantly beats the attack success rate of current state-of-the-art attack methods. Furthermore, our findings reveal that models subjected to concept erasure are more susceptible to adversarial attacks than previously anticipated, highlighting the urgency for more robust unlearning approaches. We open source all our code at https://github.com/LucasBeerens/RECORD

Related papers

Erased or Dormant? Rethinking Concept Erasure Through Reversibility [8.454050090398713]
We evaluate two representative concept erasure methods, Unified Concept Editing and Erased Stable Diffusion.<n>We show that erased concepts often reemerge with substantial visual fidelity after minimal adaptation.<n>Our findings reveal critical limitations in existing concept erasure approaches.
arXiv Detail & Related papers (2025-05-22T03:26:46Z)
Erased but Not Forgotten: How Backdoors Compromise Concept Erasure [36.056298969999645]
We introduce a new threat model, Toxic Erasure (ToxE), and demonstrate how recent unlearning algorithms can be circumvented through targeted backdoor attacks. For explicit content erasure, ToxE attacks can elicit up to 9 times more exposed body parts, with DISA yielding an average increase by a factor of 2.9.
arXiv Detail & Related papers (2025-04-29T16:13:06Z)
Continual Unlearning for Foundational Text-to-Image Models without Generalization Erosion [56.35484513848296]
This research introduces continual unlearning', a novel paradigm that enables the targeted removal of multiple specific concepts from foundational generative models.<n>We propose Decremental Unlearning without Generalization Erosion (DUGE) algorithm which selectively unlearns the generation of undesired concepts.
arXiv Detail & Related papers (2025-03-17T23:17:16Z)
TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models [45.393001061726366]
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images. To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts. We propose TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation.
arXiv Detail & Related papers (2025-03-10T14:37:53Z)
TraSCE: Trajectory Steering for Concept Erasure [16.752023123940674]
Text-to-image diffusion models have been shown to generate harmful content such as not-safe-for-work (NSFW) images.<n>We propose TraSCE, an approach to guide the diffusion trajectory away from generating harmful content.
arXiv Detail & Related papers (2024-12-10T16:45:03Z)
Towards Lifelong Few-Shot Customization of Text-to-Image Diffusion [50.26583654615212]
Lifelong few-shot customization for text-to-image diffusion aims to continually generalize existing models for new tasks with minimal data. In this study, we identify and categorize the catastrophic forgetting problems into two folds: relevant concepts forgetting and previous concepts forgetting. Unlike existing methods that rely on additional real data or offline replay of original concept data, our approach enables on-the-fly knowledge distillation to retain the previous concepts while learning new ones.
arXiv Detail & Related papers (2024-11-08T12:58:48Z)
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models [76.39651111467832]
We introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning. To mitigate inappropriate content potentially represented by derived embeddings, RECE aligns them with harmless concepts in cross-attention layers. The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts.
arXiv Detail & Related papers (2024-07-17T08:04:28Z)
Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion [51.931083971448885]
We propose a framework named Human Feedback Inversion (HFI), where human feedback on model-generated images is condensed into textual tokens guiding the mitigation or removal of problematic images. Our experimental results demonstrate our framework significantly reduces objectionable content generation while preserving image quality, contributing to the ethical deployment of AI in the public sphere.
arXiv Detail & Related papers (2024-07-17T05:21:41Z)
Rethinking and Defending Protective Perturbation in Personalized Diffusion Models [21.30373461975769]
We study the fine-tuning process of personalized diffusion models (PDMs) through the lens of shortcut learning. PDMs are susceptible to minor adversarial perturbations, leading to significant degradation when fine-tuned on corrupted datasets. We propose a systematic defense framework that includes data purification and contrastive decoupling learning.
arXiv Detail & Related papers (2024-06-27T07:14:14Z)
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts. The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts. concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z)
Pruning for Robust Concept Erasing in Diffusion Models [27.67237515704348]
We introduce a new pruning-based strategy for concept erasing. Our method selectively prunes critical parameters associated with the concepts targeted for removal, thereby reducing the sensitivity of concept-related neurons. Experimental results show a significant enhancement in our model's ability to resist adversarial inputs.
arXiv Detail & Related papers (2024-05-26T11:42:20Z)
Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention [62.671435607043875]
Research indicates that text-to-image diffusion models replicate images from their training data, raising tremendous concerns about potential copyright infringement and privacy risks. We reveal that during memorization, the cross-attention tends to focus disproportionately on the embeddings of specific tokens. We introduce an innovative approach to detect and mitigate memorization in diffusion models.
arXiv Detail & Related papers (2024-03-17T01:27:00Z)
Separable Multi-Concept Erasure from Diffusion Models [52.51972530398691]
We propose a Separable Multi-concept Eraser (SepME) to eliminate unsafe concepts from large-scale diffusion models. The latter separates optimizable model weights, making each weight increment correspond to a specific concept erasure. Extensive experiments indicate the efficacy of our approach in eliminating concepts, preserving model performance, and offering flexibility in the erasure or recovery of various concepts.
arXiv Detail & Related papers (2024-02-03T11:10:57Z)
A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models [52.49582606341111]
Copyright law confers creators the exclusive rights to reproduce, distribute, and monetize their creative works. Recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement. We introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset.
arXiv Detail & Related papers (2024-01-04T11:14:01Z)
Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers [24.64639078273091]
Concept erasure in text-to-image diffusion models aims to disable pre-trained diffusion models from generating images related to a target concept. We propose Reliable Concept Erasing via Lightweight Erasers (Receler)
arXiv Detail & Related papers (2023-11-29T15:19:49Z)
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now [22.75295925610285]
diffusion models (DMs) have revolutionized the generation of realistic and complex images. DMs also introduce potential safety hazards, such as producing harmful content and infringing data copyrights. Despite the development of safety-driven unlearning techniques, doubts about their efficacy persist.
arXiv Detail & Related papers (2023-10-18T10:36:34Z)
Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models [63.20512617502273]
We propose a method called SDD to prevent problematic content generation in text-to-image diffusion models. Our method eliminates a much greater proportion of harmful content from the generated images without degrading the overall image quality.
arXiv Detail & Related papers (2023-07-12T07:48:29Z)
Generative Model-Based Attack on Learnable Image Encryption for Privacy-Preserving Deep Learning [14.505867475659276]
We propose a novel generative model-based attack on learnable image encryption methods proposed for privacy-preserving deep learning. We use two state-of-the-art generative models: a StyleGAN-based model and latent diffusion-based one. Results show that images reconstructed by the proposed method have perceptual similarities to plain images.
arXiv Detail & Related papers (2023-03-09T05:00:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.