Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2510.24012v1
- Date: Tue, 28 Oct 2025 02:37:20 GMT
- Title: Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models
- Authors: Byeonghu Na, Mina Kang, Jiseok Kwak, Minsang Park, Jiwoo Shin, SeJoon Jun, Gayoung Lee, Jin-Hwa Kim, Il-Chul Moon,
- Abstract summary: We propose Safe Text embedding Guidance (STG), a training-free approach to improve the safety of diffusion models.<n>STG adjusts the text embeddings based on a safety function evaluated on the expected final denoised image.<n>Experiments on various safety scenarios, including nudity, violence, and artist-style removal, show that STG consistently outperforms both training-based and training-free baselines.
- Score: 30.63803894651171
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image models have recently made significant advances in generating realistic and semantically coherent images, driven by advanced diffusion models and large-scale web-crawled datasets. However, these datasets often contain inappropriate or biased content, raising concerns about the generation of harmful outputs when provided with malicious text prompts. We propose Safe Text embedding Guidance (STG), a training-free approach to improve the safety of diffusion models by guiding the text embeddings during sampling. STG adjusts the text embeddings based on a safety function evaluated on the expected final denoised image, allowing the model to generate safer outputs without additional training. Theoretically, we show that STG aligns the underlying model distribution with safety constraints, thereby achieving safer outputs while minimally affecting generation quality. Experiments on various safety scenarios, including nudity, violence, and artist-style removal, show that STG consistently outperforms both training-based and training-free baselines in removing unsafe content while preserving the core semantic intent of input prompts. Our code is available at https://github.com/aailab-kaist/STG.
Related papers
- SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models [67.84174763413178]
We introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection.<n>We show that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks.
arXiv Detail & Related papers (2026-01-13T15:01:38Z) - Value-Aligned Prompt Moderation via Zero-Shot Agentic Rewriting for Safe Image Generation [11.663809872664103]
Current defenses struggle to align outputs with human values without sacrificing generation quality or incurring high costs.<n>We introduce VALOR, a zero-shot agentic framework for safer and more helpful text-to-image generation.<n>VALOR integrates layered prompt analysis with human-aligned value reasoning.
arXiv Detail & Related papers (2025-11-12T09:52:47Z) - SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models [74.11062256255387]
Text-to-image models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content.<n>We introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality.<n>SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48% across various attack scenarios.
arXiv Detail & Related papers (2025-10-05T10:24:48Z) - SafeCtrl: Region-Based Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress [48.20360860166279]
We introduce SafeCtrl, a lightweight, non-intrusive plugin that first precisely localizes unsafe content.<n>Instead of performing a hard A-to-B substitution, SafeCtrl then suppresses the harmful semantics, allowing the generative process to naturally and coherently resolve into a safe, context-aware alternative.
arXiv Detail & Related papers (2025-08-16T04:28:52Z) - PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation [30.2092299298228]
Text-to-image (T2I) models are vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery.<n>We propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network.<n>We show that PromptSafe achieves a SOTA unsafe generation rate (2.36%) while preserving high benign fidelity.
arXiv Detail & Related papers (2025-08-02T09:09:40Z) - SC-Pro: Training-Free Framework for Defending Unsafe Image Synthesis Attack [13.799517170191919]
Recent research has shown that safety checkers have vulnerabilities against adversarial attacks, allowing them to generate Not Safe For Work (NSFW) images.<n>We propose SC-Pro, a training-free framework that easily defends against adversarial attacks generating NSFW images.
arXiv Detail & Related papers (2025-01-09T16:43:21Z) - Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts.<n>We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components.<n>This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z) - Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [88.18235230849554]
Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs.<n>We leverage safe embeddings and a modified diffusion process with weighted tunable summation in the latent space to generate safer images.<n>We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models.
arXiv Detail & Related papers (2024-11-21T09:47:13Z) - Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding [16.188657772178747]
We propose Embedding Sanitizer (ES), which enhances the safety of text-to-image models by sanitizing inappropriate concepts in prompt embeddings.<n>ES is the first interpretable safe generation framework that assigns a score to each token in the prompt to indicate its potential harmfulness.
arXiv Detail & Related papers (2024-11-15T16:29:02Z) - ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning [7.099258248662009]
There is a potential risk that text-to-image (T2I) model can generate unsafe images with uncomfortable contents.
In our work, we focus on eliminating the NSFW (not safe for work) content generation from T2I model.
We propose a customized reward function consisting of the CLIP (Contrastive Language-Image Pre-training) and nudity rewards to prune the nudity contents.
arXiv Detail & Related papers (2024-10-04T19:37:56Z) - SteerDiff: Steering towards Safe Text-to-Image Diffusion Models [6.112695628229525]
Text-to-image (T2I) diffusion models can be misused to produce inappropriate content.<n>We introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model.<n>We conduct extensive experiments across various concept unlearning tasks to evaluate the effectiveness of our approach.
arXiv Detail & Related papers (2024-10-03T17:34:55Z) - Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification.
We propose Latent Guard, a framework designed to improve safety measures in text-to-image generation.
Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts.
arXiv Detail & Related papers (2024-04-11T17:59:52Z) - SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models [28.23494821842336]
Text-to-image models may be tricked into generating not-safe-for-work (NSFW) content.
We present SafeGen, a framework to mitigate sexual content generation by text-to-image models.
arXiv Detail & Related papers (2024-04-10T00:26:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.