R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model
- URL: http://arxiv.org/abs/2405.16341v2
- Date: Tue, 23 Jul 2024 07:55:03 GMT
- Title: R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model
- Authors: Changhoon Kim, Kyle Min, Yezhou Yang,
- Abstract summary: textbfRobust textbfAdrial textbfConcept textbfErase (RACE) is a novel approach designed to mitigate these risks.
RACE utilizes a sophisticated adversarial training framework to identify and mitigate adversarial text embeddings.
Our evaluations demonstrate RACE's effectiveness in defending against both white-box and black-box attacks.
- Score: 31.2030795154036
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the evolving landscape of text-to-image (T2I) diffusion models, the remarkable capability to generate high-quality images from textual descriptions faces challenges with the potential misuse of reproducing sensitive content. To address this critical issue, we introduce \textbf{R}obust \textbf{A}dversarial \textbf{C}oncept \textbf{E}rase (RACE), a novel approach designed to mitigate these risks by enhancing the robustness of concept erasure method for T2I models. RACE utilizes a sophisticated adversarial training framework to identify and mitigate adversarial text embeddings, significantly reducing the Attack Success Rate (ASR). Impressively, RACE achieves a 30 percentage point reduction in ASR for the ``nudity'' concept against the leading white-box attack method. Our extensive evaluations demonstrate RACE's effectiveness in defending against both white-box and black-box attacks, marking a significant advancement in protecting T2I diffusion models from generating inappropriate or misleading imagery. This work underlines the essential need for proactive defense measures in adapting to the rapidly advancing field of adversarial challenges. Our code is publicly available: \url{https://github.com/chkimmmmm/R.A.C.E.}
Related papers
- Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models [20.740929360321747]
Text-to-Image (T2I) generation poses risks related to the production of inappropriate or harmful content.
We propose TCBS-Attack, a query-based black-box jailbreak attack that searches for tokens located near the decision boundaries defined by text and image checkers.
Our method consistently outperforms state-of-the-art jailbreak attacks across various T2I models.
arXiv Detail & Related papers (2025-04-15T11:53:40Z) - ACE: Concept Editing in Diffusion Models without Performance Degradation [14.874352344948482]
Diffusion-based text-to-image models have demonstrated remarkable capabilities in generating realistic images.
They raise societal and ethical concerns, such as the creation of unsafe content.
We propose ACE, a new editing method that enhances concept editing in diffusion models.
arXiv Detail & Related papers (2025-03-11T07:30:18Z) - Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models [4.5656369638728656]
Distorting Embedding Space (DES) is a text encoder-based defense mechanism.
DES transforms unsafe embeddings, extracted from a text encoder using unsafe prompts, toward carefully calculated safe embedding regions.
DES also neutralizes the nudity embedding, extracted using prompt nudity", by aligning it with neutral embedding to enhance robustness against adversarial attacks.
arXiv Detail & Related papers (2025-01-31T04:14:05Z) - CE-SDWV: Effective and Efficient Concept Erasure for Text-to-Image Diffusion Models via a Semantic-Driven Word Vocabulary [19.677968878628576]
Large-scale text-to-image (T2I) diffusion models have achieved remarkable generative performance about various concepts.
With the limitation of privacy and safety in practice, the generative capability concerning NSFW (Not Safe For Work) concepts is undesirable.
We propose a framework, which removes the target concepts of T2I diffusion models in the text semantic space.
arXiv Detail & Related papers (2025-01-26T15:39:47Z) - RT-Attack: Jailbreaking Text-to-Image Models via Random Token [24.61198605177661]
We introduce a two-stage query-based black-box attack method utilizing random search.
In the first stage, we establish a preliminary prompt by maximizing the semantic similarity between the adversarial and target harmful prompts.
In the second stage, we use this initial prompt to refine our approach, creating a detailed adversarial prompt aimed at jailbreaking.
arXiv Detail & Related papers (2024-08-25T17:33:40Z) - DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization [20.958826487430194]
Red teaming attack methods are proposed to enhance or expose the T2I model's capability to generate unsuitable content.
We propose DiffZOO which applies Zeroth Order Optimization to procure gradient approximations and harnesses both C-PRV and D-PRV to enhance attack prompts.
Experiments on multiple state-of-the-art safety mechanisms show that DiffZOO attains an 8.5% higher average attack success rate than previous works.
arXiv Detail & Related papers (2024-08-18T03:16:59Z) - EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts [32.590822043053734]
Non-toxic text still carries a risk of generating non-compliant images, which is referred to as implicit unsafe prompts.
We propose a simple yet effective approach that incorporates non-compliant concepts into an erasure prompt.
Our method exhibits superior erasure effectiveness while achieving high scores in image fidelity.
arXiv Detail & Related papers (2024-08-02T05:17:14Z) - Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models [76.39651111467832]
We introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning.
To mitigate inappropriate content potentially represented by derived embeddings, RECE aligns them with harmless concepts in cross-attention layers.
The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts.
arXiv Detail & Related papers (2024-07-17T08:04:28Z) - Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion [51.931083971448885]
We propose a framework named Human Feedback Inversion (HFI), where human feedback on model-generated images is condensed into textual tokens guiding the mitigation or removal of problematic images.
Our experimental results demonstrate our framework significantly reduces objectionable content generation while preserving image quality, contributing to the ethical deployment of AI in the public sphere.
arXiv Detail & Related papers (2024-07-17T05:21:41Z) - Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts.
The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts.
concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification.
We propose Latent Guard, a framework designed to improve safety measures in text-to-image generation.
Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts.
arXiv Detail & Related papers (2024-04-11T17:59:52Z) - Evaluating the Robustness of Text-to-image Diffusion Models against
Real-world Attacks [22.651626059348356]
Text-to-image (T2I) diffusion models (DMs) have shown promise in generating high-quality images from textual descriptions.
One fundamental question is whether existing T2I DMs are robust against variations over input texts.
This work provides the first robustness evaluation of T2I DMs against real-world attacks.
arXiv Detail & Related papers (2023-06-16T00:43:35Z) - Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models [79.50701155336198]
textbfForget-Me-Not is designed to safely remove specified IDs, objects, or styles from a well-configured text-to-image model in as little as 30 seconds.
We demonstrate that Forget-Me-Not can effectively eliminate targeted concepts while maintaining the model's performance on other concepts.
It can also be adapted as a lightweight model patch for Stable Diffusion, allowing for concept manipulation and convenient distribution.
arXiv Detail & Related papers (2023-03-30T17:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.