Prompt-Based Safety Guidance Is Ineffective for Unlearned Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2511.04834v2
- Date: Thu, 13 Nov 2025 01:09:10 GMT
- Title: Prompt-Based Safety Guidance Is Ineffective for Unlearned Text-to-Image Diffusion Models
- Authors: Jiwoo Shin, Byeonghu Na, Mina Kang, Wonhyeok Choi, Il-Chul Moon,
- Abstract summary: Text-to-image generative models can produce harmful content when provided with malicious input text prompts.<n>To address this issue, two main approaches have emerged: fine-tuning the model to unlearn harmful concepts and training-free guidance methods that leverage negative prompts.<n>We propose a conceptually simple yet experimentally robust method: replacing the negative prompts used in training-free methods with implicit negative embeddings obtained through concept inversion.
- Score: 25.506755988062206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in text-to-image generative models have raised concerns about their potential to produce harmful content when provided with malicious input text prompts. To address this issue, two main approaches have emerged: (1) fine-tuning the model to unlearn harmful concepts and (2) training-free guidance methods that leverage negative prompts. However, we observe that combining these two orthogonal approaches often leads to marginal or even degraded defense performance. This observation indicates a critical incompatibility between two paradigms, which hinders their combined effectiveness. In this work, we address this issue by proposing a conceptually simple yet experimentally robust method: replacing the negative prompts used in training-free methods with implicit negative embeddings obtained through concept inversion. Our method requires no modification to either approach and can be easily integrated into existing pipelines. We experimentally validate its effectiveness on nudity and violence benchmarks, demonstrating consistent improvements in defense success rate while preserving the core semantics of input prompts.
Related papers
- Erosion Attack for Adversarial Training to Enhance Semantic Segmentation Robustness [43.63509019035562]
We propose EroSeg-AT, a vulnerability-aware adversarial training framework that leverages EroSeg to generate adversarial examples.<n>EroSeg first selects sensitive pixels based on pixel-level confidence and then progressively propagates perturbations to higher-confidence pixels, effectively disrupting the semantic consistency of the samples.
arXiv Detail & Related papers (2026-01-21T12:52:09Z) - Continual Unlearning for Text-to-Image Diffusion Models: A Regularization Perspective [35.50502807526103]
We present the first systematic study of continual unlearning in text-to-image diffusion models.<n>We show that popular unlearning methods suffer from rapid utility collapse after only a few requests.<n>We propose a gradient-projection method that constrains parameter drift to their subspace.
arXiv Detail & Related papers (2025-11-11T08:33:16Z) - Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations [2.7620215077666557]
Modern detectors are notoriously vulnerable to adversarial attacks, with paraphrasing standing out as an effective evasion technique.<n>This paper presents a comparative study of adversarial robustness, first by quantifying the limitations of standard adversarial training.<n>We then introduce a novel, significantly more resilient detection framework: Perturbation-Invariant Feature Engineering.
arXiv Detail & Related papers (2025-09-22T13:03:53Z) - Paying Alignment Tax with Contrastive Learning [6.232983467016873]
Current debiasing approaches often result in a degradation in model capabilities such as factual accuracy and knowledge retention.<n>We propose a contrastive learning framework that learns through carefully constructed positive and negative examples.
arXiv Detail & Related papers (2025-05-25T21:26:18Z) - TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models [53.937498564603054]
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images.<n>To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts.<n>We propose TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation.
arXiv Detail & Related papers (2025-03-10T14:37:53Z) - AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors [61.007590285263376]
Security concerns have driven researchers to unlearn inappropriate concepts through fine-tuning.<n>Recent fine-tuning methods exhibit a considerable performance trade-off between eliminating undesirable concepts and preserving other concepts.<n>We propose AdvAnchor, a novel approach that generates adversarial anchors to alleviate the trade-off issue.
arXiv Detail & Related papers (2024-12-28T04:44:07Z) - Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts.<n>The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts.<n> concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z) - ADDMU: Detection of Far-Boundary Adversarial Examples with Data and
Model Uncertainty Estimation [125.52743832477404]
Adversarial Examples Detection (AED) is a crucial defense technique against adversarial attacks.
We propose a new technique, textbfADDMU, which combines two types of uncertainty estimation for both regular and FB adversarial example detection.
Our new method outperforms previous methods by 3.6 and 6.0 emphAUC points under each scenario.
arXiv Detail & Related papers (2022-10-22T09:11:12Z) - Improving White-box Robustness of Pre-processing Defenses via Joint Adversarial Training [106.34722726264522]
A range of adversarial defense techniques have been proposed to mitigate the interference of adversarial noise.
Pre-processing methods may suffer from the robustness degradation effect.
A potential cause of this negative effect is that adversarial training examples are static and independent to the pre-processing model.
We propose a method called Joint Adversarial Training based Pre-processing (JATP) defense.
arXiv Detail & Related papers (2021-06-10T01:45:32Z) - Binary Classification from Positive Data with Skewed Confidence [85.18941440826309]
Positive-confidence (Pconf) classification is a promising weakly-supervised learning method.
In practice, the confidence may be skewed by bias arising in an annotation process.
We introduce the parameterized model of the skewed confidence, and propose the method for selecting the hyper parameter.
arXiv Detail & Related papers (2020-01-29T00:04:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.