STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models
- URL: http://arxiv.org/abs/2408.16807v1
- Date: Thu, 29 Aug 2024 17:29:26 GMT
- Title: STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models
- Authors: Koushik Srivatsan, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar,
- Abstract summary: We propose an approach called STEREO that involves two distinct stages.
The first stage searches thoroughly enough for strong and diverse adversarial prompts that can regenerate an erased concept from a CEM.
In the second robustly erase once stage, we introduce an anchor-concept-based compositional objective to robustly erase the target concept at one go.
- Score: 18.64776777593743
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The rapid proliferation of large-scale text-to-image generation (T2IG) models has led to concerns about their potential misuse in generating harmful content. Though many methods have been proposed for erasing undesired concepts from T2IG models, they only provide a false sense of security, as recent works demonstrate that concept-erased models (CEMs) can be easily deceived to generate the erased concept through adversarial attacks. The problem of adversarially robust concept erasing without significant degradation to model utility (ability to generate benign concepts) remains an unresolved challenge, especially in the white-box setting where the adversary has access to the CEM. To address this gap, we propose an approach called STEREO that involves two distinct stages. The first stage searches thoroughly enough for strong and diverse adversarial prompts that can regenerate an erased concept from a CEM, by leveraging robust optimization principles from adversarial training. In the second robustly erase once stage, we introduce an anchor-concept-based compositional objective to robustly erase the target concept at one go, while attempting to minimize the degradation on model utility. By benchmarking the proposed STEREO approach against four state-of-the-art concept erasure methods under three adversarial attacks, we demonstrate its ability to achieve a better robustness vs. utility trade-off. Our code and models are available at https://github.com/koushiksrivats/robust-concept-erasing.
Related papers
- Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models [56.35484513848296]
FADE (Fine grained Attenuation for Diffusion Erasure) is an adjacency-aware unlearning algorithm for text-to-image generative models.
It removes target concepts with minimal impact on correlated concepts, achieving a 12% improvement in retention performance over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-25T15:49:48Z) - CRCE: Coreference-Retention Concept Erasure in Text-to-Image Diffusion Models [19.074434401274285]
We introduce CRCE, a novel concept erasure framework.
By explicitly modeling coreferential and retained concepts semantically, CRCE enables more precise concept removal.
Experiments demonstrate that CRCE outperforms existing methods on diverse erasure tasks.
arXiv Detail & Related papers (2025-03-18T13:09:01Z) - Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models [24.15603438969762]
Interpret then Deactivate (ItD) is a novel framework to enable precise concept removal in T2I diffusion models.
ItD uses a sparse autoencoder to interpret each concept as a combination of multiple features.
It can be easily extended to erase multiple concepts without requiring further training.
arXiv Detail & Related papers (2025-03-12T14:46:40Z) - Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models [76.39651111467832]
We introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning.
To mitigate inappropriate content potentially represented by derived embeddings, RECE aligns them with harmless concepts in cross-attention layers.
The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts.
arXiv Detail & Related papers (2024-07-17T08:04:28Z) - Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts.
The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts.
concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z) - Pruning for Robust Concept Erasing in Diffusion Models [27.67237515704348]
We introduce a new pruning-based strategy for concept erasing.
Our method selectively prunes critical parameters associated with the concepts targeted for removal, thereby reducing the sensitivity of concept-related neurons.
Experimental results show a significant enhancement in our model's ability to resist adversarial inputs.
arXiv Detail & Related papers (2024-05-26T11:42:20Z) - R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model [31.2030795154036]
textbfRobust textbfAdrial textbfConcept textbfErase (RACE) is a novel approach designed to mitigate these risks.
RACE utilizes a sophisticated adversarial training framework to identify and mitigate adversarial text embeddings.
Our evaluations demonstrate RACE's effectiveness in defending against both white-box and black-box attacks.
arXiv Detail & Related papers (2024-05-25T19:56:01Z) - MACE: Mass Concept Erasure in Diffusion Models [11.12833789743765]
We introduce MACE, a finetuning framework for the task of mass concept erasure.
This task aims to prevent models from generating images that embody unwanted concepts when prompted.
We conduct extensive evaluations of MACE against prior methods across four different tasks.
arXiv Detail & Related papers (2024-03-10T08:50:56Z) - Separable Multi-Concept Erasure from Diffusion Models [52.51972530398691]
We propose a Separable Multi-concept Eraser (SepME) to eliminate unsafe concepts from large-scale diffusion models.
The latter separates optimizable model weights, making each weight increment correspond to a specific concept erasure.
Extensive experiments indicate the efficacy of our approach in eliminating concepts, preserving model performance, and offering flexibility in the erasure or recovery of various concepts.
arXiv Detail & Related papers (2024-02-03T11:10:57Z) - Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers [24.64639078273091]
Concept erasure in text-to-image diffusion models aims to disable pre-trained diffusion models from generating images related to a target concept.
We propose Reliable Concept Erasing via Lightweight Erasers (Receler)
arXiv Detail & Related papers (2023-11-29T15:19:49Z) - Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models.
It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content.
Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z) - Implicit Concept Removal of Diffusion Models [92.55152501707995]
Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images.
We present the Geom-Erasing, a novel concept removal method based on the geometric-driven control.
arXiv Detail & Related papers (2023-10-09T17:13:10Z) - Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models [79.50701155336198]
textbfForget-Me-Not is designed to safely remove specified IDs, objects, or styles from a well-configured text-to-image model in as little as 30 seconds.
We demonstrate that Forget-Me-Not can effectively eliminate targeted concepts while maintaining the model's performance on other concepts.
It can also be adapted as a lightweight model patch for Stable Diffusion, allowing for concept manipulation and convenient distribution.
arXiv Detail & Related papers (2023-03-30T17:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.