Related papers: Comprehensive Assessment and Analysis for NSFW Content Erasure in Text-to-Image Diffusion Models

Comprehensive Assessment and Analysis for NSFW Content Erasure in Text-to-Image Diffusion Models

URL: http://arxiv.org/abs/2502.12527v1
Date: Tue, 18 Feb 2025 04:25:42 GMT
Title: Comprehensive Assessment and Analysis for NSFW Content Erasure in Text-to-Image Diffusion Models
Authors: Die Chen, Zhiwen Li, Cen Chen, Xiaodan Li, Jinyan Ye,
Abstract summary: Concept erasure methods can inadvertently generate NSFW content even with efforts on filtering NSFW content from the training dataset.<n>We present the first systematic investigation of concept erasure methods for NSFW content and its sub-themes in text-to-image diffusion models.<n>We provide a holistic evaluation of 11 state-of-the-art baseline methods with 14 variants.
Score: 16.60455968933097
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image (T2I) diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of these models can inadvertently led they to generate NSFW content even with efforts on filtering NSFW content from the training dataset, posing risks to their safe deployment. While several concept erasure methods have been proposed to mitigate this issue, a comprehensive evaluation of their effectiveness remains absent. To bridge this gap, we present the first systematic investigation of concept erasure methods for NSFW content and its sub-themes in text-to-image diffusion models. At the task level, we provide a holistic evaluation of 11 state-of-the-art baseline methods with 14 variants. Specifically, we analyze these methods from six distinct assessment perspectives, including three conventional perspectives, i.e., erasure proportion, image quality, and semantic alignment, and three new perspectives, i.e., excessive erasure, the impact of explicit and implicit unsafe prompts, and robustness. At the tool level, we perform a detailed toxicity analysis of NSFW datasets and compare the performance of different NSFW classifiers, offering deeper insights into their performance alongside a compilation of comprehensive evaluation metrics. Our benchmark not only systematically evaluates concept erasure methods, but also delves into the underlying factors influencing their performance at the insight level. By synthesizing insights from various evaluation perspectives, we provide a deeper understanding of the challenges and opportunities in the field, offering actionable guidance and inspiration for advancing research and practical applications in concept erasure.

Related papers

Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models [35.41653420113366]
Strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content.<n>We introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods.
arXiv Detail & Related papers (2025-05-21T12:31:45Z)
A Meaningful Perturbation Metric for Evaluating Explainability Methods [55.09730499143998]
We introduce a novel approach, which harnesses image generation models to perform targeted perturbation. Specifically, we focus on inpainting only the high-relevance pixels of an input image to modify the model's predictions while preserving image fidelity. This is in contrast to existing approaches, which often produce out-of-distribution modifications, leading to unreliable results.
arXiv Detail & Related papers (2025-04-09T11:46:41Z)
EraseBench: Understanding The Ripple Effects of Concept Erasure Techniques [20.2544260436998]
Concept erasure techniques can remove unwanted concepts from text-to-image models.<n>We systematically investigate the failure modes of current concept erasure techniques.<n>We introduce EraseBENCH, a benchmark designed to assess concept erasure methods with greater depth.<n>Our findings reveal that even state-of-the-art techniques struggle with maintaining quality post-erasure, indicating that these approaches are not yet ready for real-world deployment.
arXiv Detail & Related papers (2025-01-16T20:42:17Z)
On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [49.60774626839712]
multimodal generative models have sparked critical discussions on their fairness, reliability, and potential for misuse. We propose an evaluation framework designed to assess model reliability through their responses to perturbations in the embedding space. Our method lays the groundwork for detecting unreliable, bias-injected models and retrieval of bias provenance.
arXiv Detail & Related papers (2024-11-21T09:46:55Z)
Text-to-Image Representativity Fairness Evaluation Framework [0.42970700836450487]
We propose Text-to-Image (TTI) Representativity Fairness Evaluation Framework. In this framework, we evaluate three aspects of a TTI system; diversity, inclusion, and quality. The evaluation of our framework on Stable Diffusion shows that the framework can effectively capture the bias in TTI systems.
arXiv Detail & Related papers (2024-10-18T06:31:57Z)
Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning [8.831339626121848]
Concept unlearning is a promising solution to unethical or harmful use of text-to-image diffusion models. Our benchmark covers 33 target concepts, including 16,000 prompts per concept, spanning four categories: Celebrity, Style, Intellectual Property, and NSFW. Our investigation reveals that no single method excels across all evaluation criteria.
arXiv Detail & Related papers (2024-10-08T03:30:39Z)
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts. The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts. concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z)
Toward Understanding the Disagreement Problem in Neural Network Feature Attribution [0.8057006406834466]
neural networks have demonstrated their remarkable ability to discern intricate patterns and relationships from raw data. Understanding the inner workings of these black box models remains challenging, yet crucial for high-stake decisions. Our work addresses this confusion by investigating the explanations' fundamental and distributional behavior.
arXiv Detail & Related papers (2024-04-17T12:45:59Z)
Better Understanding Differences in Attribution Methods via Systematic Evaluations [57.35035463793008]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions. We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods. We use these evaluation schemes to study strengths and shortcomings of some widely used attribution methods over a wide range of models.
arXiv Detail & Related papers (2023-03-21T14:24:58Z)
Towards Better Understanding Attribution Methods [77.1487219861185]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions. We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods. We also propose a post-processing smoothing step that significantly improves the performance of some attribution methods.
arXiv Detail & Related papers (2022-05-20T20:50:17Z)
Towards Fine-grained Human Pose Transfer with Detail Replenishing Network [96.54367984986898]
Human pose transfer (HPT) is an emerging research topic with huge potential in fashion design, media production, online advertising and virtual reality. Existing HPT methods often suffer from three fundamental issues: detail deficiency, content ambiguity and style inconsistency. We develop a more challenging yet practical HPT setting, termed as Fine-grained Human Pose Transfer (FHPT), with a higher focus on semantic fidelity and detail replenishment.
arXiv Detail & Related papers (2020-05-26T03:05:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.