When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
- URL: http://arxiv.org/abs/2602.20880v3
- Date: Wed, 04 Mar 2026 06:07:03 GMT
- Title: When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
- Authors: Yongli Xiang, Ziming Hong, Zhaoqing Wang, Xiangyu Zhao, Bo Han, Tongliang Liu,
- Abstract summary: Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images.<n>We propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation.<n>Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.
- Score: 79.1968671824977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to "harmful conflicts" where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model's evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.
Related papers
- HERS: Hidden-Pattern Expert Learning for Risk-Specific Vehicle Damage Adaptation in Diffusion Models [0.0]
Recent advances in text-to-image (T2I) diffusion models have enabled increasingly realistic synthesis of vehicle damage.<n>The ability to generate crash-like imagery challenges the boundary between authentic and synthetic data.<n>We propose HERS, a framework designed to improve fidelity, controllability, and domain alignment of diffusion-generated damage images.
arXiv Detail & Related papers (2026-01-29T10:30:07Z) - SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment [43.86865924673546]
We propose SafeThinker, an adaptive framework that allocates defensive resources via a lightweight gateway classifier.<n> Experiments show that SafeThinker significantly lowers attack success rates across diverse jailbreak strategies without compromising robustness.
arXiv Detail & Related papers (2026-01-23T07:12:53Z) - SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models [67.84174763413178]
We introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection.<n>We show that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks.
arXiv Detail & Related papers (2026-01-13T15:01:38Z) - Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines [31.031589383127677]
This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework.<n>It internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against adversarial prompts.<n> experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.
arXiv Detail & Related papers (2025-11-26T09:44:32Z) - SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models [74.11062256255387]
Text-to-image models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content.<n>We introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality.<n>SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48% across various attack scenarios.
arXiv Detail & Related papers (2025-10-05T10:24:48Z) - Shape it Up! Restoring LLM Safety during Finetuning [65.75757313781104]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z) - Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization [22.225141381422873]
There is a growing concern about text-to-image diffusion models creating harmful content.<n>Post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed to mitigate these risks.<n>We propose the safe generation framework Detect-and-Guide (DAG) to perform self-diagnosis and fine-interpret self-regulation.<n>DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on real-world prompts.
arXiv Detail & Related papers (2025-03-19T13:37:52Z) - SafeCFG: Controlling Harmful Features with Dynamic Safe Guidance for Safe Generation [125.0706666755989]
Diffusion models (DMs) have demonstrated exceptional performance in text-to-image tasks.<n>They can be used to generate more harmful images by maliciously guiding the image generation process through CFG.<n>We propose SafeCFG to adaptively control harmful features with dynamic safe guidance.
arXiv Detail & Related papers (2024-12-20T16:40:11Z) - SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior [56.10557932893919]
We present SafetyAnalyst, a novel AI safety moderation framework.<n>Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences.<n>It aggregates effects into a harmfulness score using 28 fully interpretable weight parameters.
arXiv Detail & Related papers (2024-10-22T03:38:37Z) - Exploring Robustness of Unsupervised Domain Adaptation in Semantic
Segmentation [74.05906222376608]
We propose adversarial self-supervision UDA (or ASSUDA) that maximizes the agreement between clean images and their adversarial examples by a contrastive loss in the output space.
This paper is rooted in two observations: (i) the robustness of UDA methods in semantic segmentation remains unexplored, which pose a security concern in this field; and (ii) although commonly used self-supervision (e.g., rotation and jigsaw) benefits image tasks such as classification and recognition, they fail to provide the critical supervision signals that could learn discriminative representation for segmentation tasks.
arXiv Detail & Related papers (2021-05-23T01:50:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.