Related papers: Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts

Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts

URL: http://arxiv.org/abs/2309.06135v2
Date: Sat, 8 Jun 2024 05:30:26 GMT
Title: Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts
Authors: Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu,
Abstract summary: Text-to-image diffusion models have shown remarkable ability in high-quality content generation. This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models. Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
Score: 63.61248884015162
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image diffusion models, e.g. Stable Diffusion (SD), lately have shown remarkable ability in high-quality content generation, and become one of the representatives for the recent wave of transformative AI. Nevertheless, such advance comes with an intensifying concern about the misuse of this generative technology, especially for producing copyrighted or NSFW (i.e. not safe for work) images. Although efforts have been made to filter inappropriate images/prompts or remove undesirable concepts/styles via model fine-tuning, the reliability of these safety mechanisms against diversified problematic prompts remains largely unexplored. In this work, we propose Prompting4Debugging (P4D) as a debugging and red-teaming tool that automatically finds problematic prompts for diffusion models to test the reliability of a deployed safety mechanism. We demonstrate the efficacy of our P4D tool in uncovering new vulnerabilities of SD models with safety mechanisms. Particularly, our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms, including concept removal, negative prompt, and safety guidance. Our findings suggest that, without comprehensive testing, the evaluations on limited safe prompting benchmarks can lead to a false sense of safety for text-to-image models.

Related papers

PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation [30.2092299298228]
Text-to-image (T2I) models are vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery.<n>We propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network.<n>We show that PromptSafe achieves a SOTA unsafe generation rate (2.36%) while preserving high benign fidelity.
arXiv Detail & Related papers (2025-08-02T09:09:40Z)
TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis [19.73325740171627]
We introduce TokenProber, a method designed for sensitivity-aware differential testing.<n>Our approach is based on the key observation that adversarial prompts often succeed by exploiting discrepancies in how T2I models and safety checkers interpret sensitive content.<n>Our evaluation of TokenProber against 5 safety checkers on 3 popular T2I models, using 324 NSFW prompts, demonstrates its superior effectiveness.
arXiv Detail & Related papers (2025-05-11T06:32:33Z)
Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models [4.5656369638728656]
Distorting Embedding Space (DES) is a text encoder-based defense mechanism. DES transforms unsafe embeddings, extracted from a text encoder using unsafe prompts, toward carefully calculated safe embedding regions. DES also neutralizes the nudity embedding, extracted using prompt nudity", by aligning it with neutral embedding to enhance robustness against adversarial attacks.
arXiv Detail & Related papers (2025-01-31T04:14:05Z)
CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis with Latent Diffusion Models [13.799517170191919]
Recent research has shown that safety checkers have vulnerabilities against adversarial attacks, allowing them to generate Not Safe For Work (NSFW) images. We propose CROPS, a model-agnostic framework that easily defends against adversarial attacks generating NSFW images without requiring additional training.
arXiv Detail & Related papers (2025-01-09T16:43:21Z)
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts. We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components. This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z)
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models [97.82118821263825]
Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community. We propose ICER, a novel red-teaming framework that generates interpretable and semantic meaningful problematic prompts. Our work provides crucial insights for developing more robust safety mechanisms in T2I systems.
arXiv Detail & Related papers (2024-11-25T04:17:24Z)
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [49.60774626839712]
Training multimodal generative models can expose users to harmful, unsafe and controversial or culturally-inappropriate outputs. We propose a modular, dynamic solution that leverages safety-context embeddings and a dual reconstruction process to generate safer images. We achieve state-of-the-art results on safe image generation benchmarks, while offering controllable variation of model safety.
arXiv Detail & Related papers (2024-11-21T09:47:13Z)
AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models [20.37481116837779]
AdvI2I is a novel framework that manipulates input images to induce diffusion models to generate NSFW content. By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms. We show that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards.
arXiv Detail & Related papers (2024-10-28T19:15:06Z)
SteerDiff: Steering towards Safe Text-to-Image Diffusion Models [5.781285400461636]
Text-to-image (T2I) diffusion models can be misused to produce inappropriate content. We introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model. We conduct extensive experiments across various concept unlearning tasks to evaluate the effectiveness of our approach.
arXiv Detail & Related papers (2024-10-03T17:34:55Z)
EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts [32.590822043053734]
Non-toxic text still carries a risk of generating non-compliant images, which is referred to as implicit unsafe prompts. We propose a simple yet effective approach that incorporates non-compliant concepts into an erasure prompt. Our method exhibits superior erasure effectiveness while achieving high scores in image fidelity.
arXiv Detail & Related papers (2024-08-02T05:17:14Z)
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models [29.866192834825572]
Unlearning techniques have been developed to remove the model's ability to generate potentially harmful content. These methods are easily bypassed by adversarial attacks, making them unreliable for ensuring the safety of generated images. We propose Direct Unlearning Optimization (DUO), a novel framework for removing Not Safe For Work (NSFW) content from T2I models.
arXiv Detail & Related papers (2024-07-17T08:19:11Z)
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models. It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content. Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)
SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution [21.93748586123046]
We develop and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant NSFW images. Our framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules. Results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts.
arXiv Detail & Related papers (2023-09-25T13:20:15Z)
FLIRT: Feedback Loop In-context Red Teaming [79.63896510559357]
We propose an automatic red teaming framework that evaluates a given black-box model and exposes its vulnerabilities. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation.
arXiv Detail & Related papers (2023-08-08T14:03:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.