SafeCtrl: Region-Based Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress
- URL: http://arxiv.org/abs/2508.11904v1
- Date: Sat, 16 Aug 2025 04:28:52 GMT
- Title: SafeCtrl: Region-Based Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress
- Authors: Lingyun Zhang, Yu Xie, Yanwei Fu, Ping Chen,
- Abstract summary: We introduce SafeCtrl, a lightweight, non-intrusive plugin that first precisely localizes unsafe content.<n>Instead of performing a hard A-to-B substitution, SafeCtrl then suppresses the harmful semantics, allowing the generative process to naturally and coherently resolve into a safe, context-aware alternative.
- Score: 48.20360860166279
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The widespread deployment of text-to-image models is challenged by their potential to generate harmful content. While existing safety methods, such as prompt rewriting or model fine-tuning, provide valuable interventions, they often introduce a trade-off between safety and fidelity. Recent localization-based approaches have shown promise, yet their reliance on explicit ``concept replacement" can sometimes lead to semantic incongruity. To address these limitations, we explore a more flexible detect-then-suppress paradigm. We introduce SafeCtrl, a lightweight, non-intrusive plugin that first precisely localizes unsafe content. Instead of performing a hard A-to-B substitution, SafeCtrl then suppresses the harmful semantics, allowing the generative process to naturally and coherently resolve into a safe, context-aware alternative. A key aspect of our work is a novel training strategy using Direct Preference Optimization (DPO). We leverage readily available, image-level preference data to train our module, enabling it to learn nuanced suppression behaviors and perform region-guided interventions at inference without requiring costly, pixel-level annotations. Extensive experiments show that SafeCtrl significantly outperforms state-of-the-art methods in both safety efficacy and fidelity preservation. Our findings suggest that decoupled, suppression-based control is a highly effective and scalable direction for building more responsible generative models.
Related papers
- SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models [67.84174763413178]
We introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection.<n>We show that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks.
arXiv Detail & Related papers (2026-01-13T15:01:38Z) - Value-Aligned Prompt Moderation via Zero-Shot Agentic Rewriting for Safe Image Generation [11.663809872664103]
Current defenses struggle to align outputs with human values without sacrificing generation quality or incurring high costs.<n>We introduce VALOR, a zero-shot agentic framework for safer and more helpful text-to-image generation.<n>VALOR integrates layered prompt analysis with human-aligned value reasoning.
arXiv Detail & Related papers (2025-11-12T09:52:47Z) - SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models [74.11062256255387]
Text-to-image models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content.<n>We introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality.<n>SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48% across various attack scenarios.
arXiv Detail & Related papers (2025-10-05T10:24:48Z) - PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation [30.2092299298228]
Text-to-image (T2I) models are vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery.<n>We propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network.<n>We show that PromptSafe achieves a SOTA unsafe generation rate (2.36%) while preserving high benign fidelity.
arXiv Detail & Related papers (2025-08-02T09:09:40Z) - Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z) - Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions [35.28819408507869]
Concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases.<n>We propose a novel self-discovery approach to identify a semantic direction vector in the embedding space to restrict text embedding within a safe region.<n>Our method can effectively reduce NSFW content and social bias generated by diffusion models compared to several state-of-the-art baselines.
arXiv Detail & Related papers (2025-05-21T12:10:26Z) - Hyperbolic Safety-Aware Vision-Language Models [44.06996781749013]
We introduce a novel approach that shifts from unlearning to an awareness paradigm by leveraging the inherent hierarchical properties of the hyperbolic space.<n>Our HySAC, Hyperbolic Safety-Aware CLIP, employs entailment loss functions to model the hierarchical and asymmetrical relations between safe and unsafe image-text pairs.<n>Our approach not only enhances safety recognition but also establishes a more adaptable and interpretable framework for content moderation in vision-language models.
arXiv Detail & Related papers (2025-03-15T13:18:04Z) - Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts.<n>We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components.<n>This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z) - Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [88.18235230849554]
Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs.<n>We leverage safe embeddings and a modified diffusion process with weighted tunable summation in the latent space to generate safer images.<n>We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models.
arXiv Detail & Related papers (2024-11-21T09:47:13Z) - Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation.
This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models.
Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.