Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2505.15450v1
- Date: Wed, 21 May 2025 12:31:45 GMT
- Title: Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models
- Authors: Die Chen, Zhiwen Li, Cen Chen, Yuexiang Xie, Xiaodan Li, Jinyan Ye, Yingda Chen, Yaliang Li,
- Abstract summary: Strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content.<n>We introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods.
- Score: 35.41653420113366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content, posing significant risks to their safe deployment. While several concept erasure methods have been proposed to mitigate the issue associated with NSFW content, a comprehensive evaluation of their effectiveness across various scenarios remains absent. To bridge this gap, we introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods. By examining the interplay between the underlying mechanisms and empirical observations, we provide in-depth insights and practical guidance for the effective application of concept erasure methods in various real-world scenarios, with the aim of advancing the understanding of content safety in diffusion models and establishing a solid foundation for future research and development in this critical area.
Related papers
- When Are Concepts Erased From Diffusion Models? [44.89615668122767]
Concept erasure is the ability to selectively prevent a model from generating specific concepts.<n>We propose two conceptual models for the erasure mechanism in diffusion models.<n>To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations.
arXiv Detail & Related papers (2025-05-22T17:59:09Z) - Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions [35.28819408507869]
Concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases.<n>We propose a novel self-discovery approach to identify a semantic direction vector in the embedding space to restrict text embedding within a safe region.<n>Our method can effectively reduce NSFW content and social bias generated by diffusion models compared to several state-of-the-art baselines.
arXiv Detail & Related papers (2025-05-21T12:10:26Z) - Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization [22.225141381422873]
There is a growing concern about text-to-image diffusion models creating harmful content.<n>Post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed to mitigate these risks.<n>We propose the safe generation framework Detect-and-Guide (DAG) to perform self-diagnosis and fine-interpret self-regulation.<n>DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on real-world prompts.
arXiv Detail & Related papers (2025-03-19T13:37:52Z) - Comprehensive Assessment and Analysis for NSFW Content Erasure in Text-to-Image Diffusion Models [16.60455968933097]
Concept erasure methods can inadvertently generate NSFW content even with efforts on filtering NSFW content from the training dataset.<n>We present the first systematic investigation of concept erasure methods for NSFW content and its sub-themes in text-to-image diffusion models.<n>We provide a holistic evaluation of 11 state-of-the-art baseline methods with 14 variants.
arXiv Detail & Related papers (2025-02-18T04:25:42Z) - On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [49.60774626839712]
multimodal generative models have sparked critical discussions on their fairness, reliability, and potential for misuse.
We propose an evaluation framework designed to assess model reliability through their responses to perturbations in the embedding space.
Our method lays the groundwork for detecting unreliable, bias-injected models and retrieval of bias provenance.
arXiv Detail & Related papers (2024-11-21T09:46:55Z) - Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts.<n>The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts.<n> concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z) - Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models [58.065255696601604]
We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation.
We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary.
arXiv Detail & Related papers (2024-04-21T16:35:16Z) - Bridging Generative and Discriminative Models for Unified Visual
Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors.
Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages.
The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z) - Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models.
It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content.
Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.