PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation
- URL: http://arxiv.org/abs/2508.01272v1
- Date: Sat, 02 Aug 2025 09:09:40 GMT
- Title: PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation
- Authors: Zonglei Jing, Xiao Yang, Xiaoqian Li, Siyuan Liang, Aishan Liu, Mingchuan Zhang, Xianglong Liu,
- Abstract summary: Text-to-image (T2I) models are vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery.<n>We propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network.<n>We show that PromptSafe achieves a SOTA unsafe generation rate (2.36%) while preserving high benign fidelity.
- Score: 30.2092299298228
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image (T2I) models have demonstrated remarkable generative capabilities but remain vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery. While recent moderation efforts have introduced soft prompt-guided tuning by appending defensive tokens to the input, these approaches often rely on large-scale curated image-text datasets and apply static, one-size-fits-all defenses at inference time. However, this results not only in high computational cost and degraded benign image quality, but also in limited adaptability to the diverse and nuanced safety requirements of real-world prompts. To address these challenges, we propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network. Instead of training on expensive image-text datasets, we first rewrite unsafe prompts into semantically aligned but safe alternatives using an LLM, constructing an efficient text-only training corpus. Based on this, we optimize a universal soft prompt that repels unsafe and attracts safe embeddings during the diffusion denoising process. To avoid over-suppressing benign prompts, we introduce a gated mechanism that adaptively adjusts the defensive strength based on estimated prompt toxicity, thereby aligning defense intensity with prompt risk and ensuring strong protection for harmful inputs while preserving benign generation quality. Extensive experiments across multiple benchmarks and T2I models show that PromptSafe achieves a SOTA unsafe generation rate (2.36%), while preserving high benign fidelity. Furthermore, PromptSafe demonstrates strong generalization to unseen harmful categories, robust transferability across diffusion model architectures, and resilience under adaptive adversarial attacks, highlighting its practical value for safe and scalable deployment.
Related papers
- Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model [52.72318433518926]
Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content.<n>We introduce a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations.<n>We propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head.
arXiv Detail & Related papers (2025-06-05T07:26:34Z) - Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z) - Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense [90.71884758066042]
Large vision-language models (LVLMs) introduce a unique vulnerability: susceptibility to malicious attacks via visual inputs.<n>We propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism.
arXiv Detail & Related papers (2025-03-14T17:39:45Z) - Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models [4.5656369638728656]
Distorting Embedding Space (DES) is a text encoder-based defense mechanism.<n>DES transforms unsafe embeddings, extracted from a text encoder using unsafe prompts, toward carefully calculated safe embedding regions.<n>DES also neutralizes the nudity embedding, extracted using prompt nudity", by aligning it with neutral embedding to enhance robustness against adversarial attacks.
arXiv Detail & Related papers (2025-01-31T04:14:05Z) - PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models [15.510489107957005]
Text-to-image (T2I) models have been shown to be vulnerable to misuse, particularly in generating not-safe-for-work (NSFW) content.<n>We present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment.
arXiv Detail & Related papers (2025-01-07T05:39:21Z) - Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization [29.378296359782585]
Text-to-Image (T2I) diffusion models are widely recognized for their ability to generate high-quality and diverse images based on text prompts.<n>Current efforts to prevent inappropriate image generation for T2I models are easy to bypass and vulnerable to adversarial attacks.<n>We propose a novel, training-free approach, called Prompt-Noise Optimization (PNO), to mitigate unsafe image generation.
arXiv Detail & Related papers (2024-12-05T05:12:30Z) - SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation [65.30207993362595]
Unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges.<n>We propose SAFREE, a training-free approach for safe T2I and T2V.<n>We detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace.
arXiv Detail & Related papers (2024-10-16T17:32:23Z) - Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching [74.62818936088065]
textscSafePatching is a novel framework for comprehensive PSA.<n>textscSafePatching achieves a more comprehensive PSA than baseline methods.<n>textscSafePatching demonstrates its superiority in continual PSA scenarios.
arXiv Detail & Related papers (2024-05-22T16:51:07Z) - SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution [21.93748586123046]
We develop and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant NSFW images.
Our framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules.
Results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts.
arXiv Detail & Related papers (2023-09-25T13:20:15Z) - Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation.
This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models.
Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.