Related papers: Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions

Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions

URL: http://arxiv.org/abs/2505.15427v1
Date: Wed, 21 May 2025 12:10:26 GMT
Title: Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions
Authors: Zhiwen Li, Die Chen, Mingyuan Fan, Cen Chen, Yaliang Li, Yanhao Wang, Wenmeng Zhou,
Abstract summary: Concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases.<n>We propose a novel self-discovery approach to identify a semantic direction vector in the embedding space to restrict text embedding within a safe region.<n>Our method can effectively reduce NSFW content and social bias generated by diffusion models compared to several state-of-the-art baselines.
Score: 35.28819408507869
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The remarkable ability of diffusion models to generate high-fidelity images has led to their widespread adoption. However, concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases, hindering their practical use in real-world applications. In response to this challenge, prior work has focused on employing security filters to identify and exclude toxic text, or alternatively, fine-tuning pre-trained diffusion models to erase sensitive concepts. Unfortunately, existing methods struggle to achieve satisfactory performance in the sense that they can have a significant impact on the normal model output while still failing to prevent the generation of harmful content in some cases. In this paper, we propose a novel self-discovery approach to identifying a semantic direction vector in the embedding space to restrict text embedding within a safe region. Our method circumvents the need for correcting individual words within the input text and steers the entire text prompt towards a safe region in the embedding space, thereby enhancing model robustness against all possibly unsafe prompts. In addition, we employ Low-Rank Adaptation (LoRA) for semantic direction vector initialization to reduce the impact on the model performance for other semantics. Furthermore, our method can also be integrated with existing methods to improve their social responsibility. Extensive experiments on benchmark datasets demonstrate that our method can effectively reduce NSFW content and mitigate social bias generated by diffusion models compared to several state-of-the-art baselines.

Related papers

Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation [13.971909819796762]
Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity.<n>Embedding space poisoning is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms.<n>We propose ETTA, a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations.
arXiv Detail & Related papers (2025-07-08T03:01:00Z)
Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models [35.41653420113366]
Strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content.<n>We introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods.
arXiv Detail & Related papers (2025-05-21T12:31:45Z)
Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization [22.225141381422873]
There is a growing concern about text-to-image diffusion models creating harmful content.<n>Post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed to mitigate these risks.<n>We propose the safe generation framework Detect-and-Guide (DAG) to perform self-diagnosis and fine-interpret self-regulation.<n>DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on real-world prompts.
arXiv Detail & Related papers (2025-03-19T13:37:52Z)
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts.<n>We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components.<n>This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z)
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [88.18235230849554]
Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs.<n>We leverage safe embeddings and a modified diffusion process with weighted tunable summation in the latent space to generate safer images.<n>We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models.
arXiv Detail & Related papers (2024-11-21T09:47:13Z)
Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)<n>To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z)
SteerDiff: Steering towards Safe Text-to-Image Diffusion Models [5.781285400461636]
Text-to-image (T2I) diffusion models can be misused to produce inappropriate content. We introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model. We conduct extensive experiments across various concept unlearning tasks to evaluate the effectiveness of our approach.
arXiv Detail & Related papers (2024-10-03T17:34:55Z)
Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks. We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z)
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models. It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content. Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z)
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models [97.80042457099718]
Diffusion model-based approaches have shown promise in data-driven planning, but there are no safety guarantees. We propose a new method, called SafeDiffuser, to ensure diffusion probabilistic models satisfy specifications. We test our method on a series of safe planning tasks, including maze path generation, legged robot locomotion, and 3D space manipulation.
arXiv Detail & Related papers (2023-05-31T19:38:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.