NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation
- URL: http://arxiv.org/abs/2506.18325v1
- Date: Mon, 23 Jun 2025 06:17:30 GMT
- Title: NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation
- Authors: Yu Xie, Chengjie Zeng, Lingyun Zhang, Yanwei Fu,
- Abstract summary: "jailbreak" attacks in large language models bypass restrictions through subtle prompt modifications.<n>PromptSan is a novel approach to detoxify harmful prompts without altering model architecture.<n>PromptSan achieves state-of-the-art performance in reducing harmful content generation across multiple metrics.
- Score: 47.03824997129498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of text-to-image (T2I) models, such as Stable Diffusion, has enhanced their capability to synthesize images from textual prompts. However, this progress also raises significant risks of misuse, including the generation of harmful content (e.g., pornography, violence, discrimination), which contradicts the ethical goals of T2I technology and hinders its sustainable development. Inspired by "jailbreak" attacks in large language models, which bypass restrictions through subtle prompt modifications, this paper proposes NSFW-Classifier Guided Prompt Sanitization (PromptSan), a novel approach to detoxify harmful prompts without altering model architecture or degrading generation capability. PromptSan includes two variants: PromptSan-Modify, which iteratively identifies and replaces harmful tokens in input prompts using text NSFW classifiers during inference, and PromptSan-Suffix, which trains an optimized suffix token sequence to neutralize harmful intent while passing both text and image NSFW classifier checks. Extensive experiments demonstrate that PromptSan achieves state-of-the-art performance in reducing harmful content generation across multiple metrics, effectively balancing safety and usability.
Related papers
- PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation [30.2092299298228]
Text-to-image (T2I) models are vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery.<n>We propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network.<n>We show that PromptSafe achieves a SOTA unsafe generation rate (2.36%) while preserving high benign fidelity.
arXiv Detail & Related papers (2025-08-02T09:09:40Z) - GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models [65.91565607573786]
Text-to-image (T2I) models can be misused to generate harmful content, including nudity or violence.<n>Recent research on red-teaming and adversarial attacks against T2I models has notable limitations.<n>We propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities.
arXiv Detail & Related papers (2025-06-11T09:09:12Z) - TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis [19.73325740171627]
We introduce TokenProber, a method designed for sensitivity-aware differential testing.<n>Our approach is based on the key observation that adversarial prompts often succeed by exploiting discrepancies in how T2I models and safety checkers interpret sensitive content.<n>Our evaluation of TokenProber against 5 safety checkers on 3 popular T2I models, using 324 NSFW prompts, demonstrates its superior effectiveness.
arXiv Detail & Related papers (2025-05-11T06:32:33Z) - PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models [15.510489107957005]
Text-to-image (T2I) models have been shown to be vulnerable to misuse, particularly in generating not-safe-for-work (NSFW) content.<n>We present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment.
arXiv Detail & Related papers (2025-01-07T05:39:21Z) - Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding [16.188657772178747]
We propose Embedding Sanitizer (ES), which enhances the safety of text-to-image models by sanitizing inappropriate concepts in prompt embeddings.<n>ES is the first interpretable safe generation framework that assigns a score to each token in the prompt to indicate its potential harmfulness.
arXiv Detail & Related papers (2024-11-15T16:29:02Z) - AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models [20.37481116837779]
AdvI2I is a novel framework that manipulates input images to induce diffusion models to generate NSFW content.
By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms.
We show that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards.
arXiv Detail & Related papers (2024-10-28T19:15:06Z) - SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation [65.30207993362595]
Unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges.<n>We propose SAFREE, a training-free approach for safe T2I and T2V.<n>We detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace.
arXiv Detail & Related papers (2024-10-16T17:32:23Z) - Dynamic Prompt Optimizing for Text-to-Image Generation [63.775458908172176]
We introduce the textbfPrompt textbfAuto-textbfEditing (PAE) method to improve text-to-image generative models.
We employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts.
arXiv Detail & Related papers (2024-04-05T13:44:39Z) - Get What You Want, Not What You Don't: Image Content Suppression for
Text-to-Image Diffusion Models [86.92711729969488]
We analyze how to manipulate the text embeddings and remove unwanted content from them.
The first regularizes the text embedding matrix and effectively suppresses the undesired content.
The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content.
arXiv Detail & Related papers (2024-02-08T03:15:06Z) - Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation.
This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models.
Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.