Related papers: SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

URL: http://arxiv.org/abs/2404.06666v1
Date: Wed, 10 Apr 2024 00:26:08 GMT
Title: SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models
Authors: Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, Wenyuan Xu,
Abstract summary: Text-to-image models may be tricked into generating not-safe-for-work (NSFW) content, particularly in sexual scenarios. We present SafeGen, a framework to mitigate unsafe content generation by text-to-image models in a text-agnostic manner.
Score: 28.23494821842336
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Text-to-image (T2I) models, such as Stable Diffusion, have exhibited remarkable performance in generating high-quality images from text descriptions in recent years. However, text-to-image models may be tricked into generating not-safe-for-work (NSFW) content, particularly in sexual scenarios. Existing countermeasures mostly focus on filtering inappropriate inputs and outputs, or suppressing improper text embeddings, which can block explicit NSFW-related content (e.g., naked or sexy) but may still be vulnerable to adversarial prompts inputs that appear innocent but are ill-intended. In this paper, we present SafeGen, a framework to mitigate unsafe content generation by text-to-image models in a text-agnostic manner. The key idea is to eliminate unsafe visual representations from the model regardless of the text input. In this way, the text-to-image model is resistant to adversarial prompts since unsafe visual representations are obstructed from within. Extensive experiments conducted on four datasets demonstrate SafeGen's effectiveness in mitigating unsafe content generation while preserving the high-fidelity of benign images. SafeGen outperforms eight state-of-the-art baseline methods and achieves 99.1% sexual content removal performance. Furthermore, our constructed benchmark of adversarial prompts provides a basis for future development and evaluation of anti-NSFW-generation methods.

Related papers

NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation [47.03824997129498]
"jailbreak" attacks in large language models bypass restrictions through subtle prompt modifications.<n>PromptSan is a novel approach to detoxify harmful prompts without altering model architecture.<n>PromptSan achieves state-of-the-art performance in reducing harmful content generation across multiple metrics.
arXiv Detail & Related papers (2025-06-23T06:17:30Z)
Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images [5.150015329535525]
We identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms. Existing mitigation techniques fail to prevent harmful text generation while substantially degrading text generation. To advance research in this area, we introduce ToxicBench, an open-source benchmark for evaluating NSFW text generation in images.
arXiv Detail & Related papers (2025-02-07T16:39:39Z)
CogMorph: Cognitive Morphing Attacks for Text-to-Image Models [65.38747950692752]
This paper reveals a significant and previously unrecognized ethical risk inherent in text-to-image (T2I) generative models. We introduce a novel method, termed the Cognitive Morphing Attack (CogMorph), which manipulates T2I models to generate images that retain the original core subjects but embeds toxic or harmful contextual elements.
arXiv Detail & Related papers (2025-01-21T01:45:56Z)
CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis with Latent Diffusion Models [13.799517170191919]
Recent research has shown that safety checkers have vulnerabilities against adversarial attacks, allowing them to generate Not Safe For Work (NSFW) images. We propose CROPS, a model-agnostic framework that easily defends against adversarial attacks generating NSFW images without requiring additional training.
arXiv Detail & Related papers (2025-01-09T16:43:21Z)
Buster: Implanting Semantic Backdoor into Text Encoder to Mitigate NSFW Content Generation [15.703408347981776]
We propose an innovative framework named textitBuster, which injects backdoors into the text encoder to prevent NSFW content generation. Buster leverages deep semantic information rather than explicit prompts as triggers, redirecting NSFW prompts towards targeted benign prompts. Our experiments denote that Buster outperforms nine state-of-the-art baselines, achieving a superior NSFW content removal rate of at least 91.2%.
arXiv Detail & Related papers (2024-12-10T07:18:51Z)
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [88.18235230849554]
Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs. We leverage safe embeddings and a modified diffusion process with weighted tunable summation in the latent space to generate safer images. We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models.
arXiv Detail & Related papers (2024-11-21T09:47:13Z)
Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding [13.481343482138888]
We propose a vision-agnostic safe generation framework, Embedding Sanitizer (ES) ES focuses on erasing inappropriate concepts from prompt embeddings and uses the sanitized embeddings to guide the model for safe generation. ES significantly outperforms existing safeguards in terms of interpretability and controllability while maintaining generation quality.
arXiv Detail & Related papers (2024-11-15T16:29:02Z)
ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning [7.099258248662009]
There is a potential risk that text-to-image (T2I) model can generate unsafe images with uncomfortable contents. In our work, we focus on eliminating the NSFW (not safe for work) content generation from T2I model. We propose a customized reward function consisting of the CLIP (Contrastive Language-Image Pre-training) and nudity rewards to prune the nudity contents.
arXiv Detail & Related papers (2024-10-04T19:37:56Z)
EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts [32.590822043053734]
Non-toxic text still carries a risk of generating non-compliant images, which is referred to as implicit unsafe prompts. We propose a simple yet effective approach that incorporates non-compliant concepts into an erasure prompt. Our method exhibits superior erasure effectiveness while achieving high scores in image fidelity.
arXiv Detail & Related papers (2024-08-02T05:17:14Z)
Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification. We propose Latent Guard, a framework designed to improve safety measures in text-to-image generation. Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts.
arXiv Detail & Related papers (2024-04-11T17:59:52Z)
Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models [86.92711729969488]
We analyze how to manipulate the text embeddings and remove unwanted content from them. The first regularizes the text embedding matrix and effectively suppresses the undesired content. The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content.
arXiv Detail & Related papers (2024-02-08T03:15:06Z)
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models [42.19184265811366]
We introduce a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences.
arXiv Detail & Related papers (2023-11-27T19:02:17Z)
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation. This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models. Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z)
ITI-GEN: Inclusive Text-to-Image Generation [56.72212367905351]
This study investigates inclusive text-to-image generative models that generate images based on human-written prompts. We show that, for some attributes, images can represent concepts more expressively than text. We propose a novel approach, ITI-GEN, that leverages readily available reference images for Inclusive Text-to-Image GENeration.
arXiv Detail & Related papers (2023-09-11T15:54:30Z)
BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models [54.19289900203071]
The rise in popularity of text-to-image generative artificial intelligence has attracted widespread public interest. We demonstrate that this technology can be attacked to generate content that subtly manipulates its users. We propose a Backdoor Attack on text-to-image Generative Models (BAGM) Our attack is the first to target three popular text-to-image generative models across three stages of the generative process.
arXiv Detail & Related papers (2023-07-31T08:34:24Z)
Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation [65.48908724440047]
We propose a method called emphreverse generation to construct adversarial contexts conditioned on a given response. We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems.
arXiv Detail & Related papers (2022-12-04T12:23:41Z)
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models [18.701950647429]
Text-conditioned image generation models suffer from degenerated and biased human behavior. We present safe latent diffusion (SLD) to help combat these undesired side effects. We show that SLD removes and suppresses inappropriate image parts during the diffusion process.
arXiv Detail & Related papers (2022-11-09T18:54:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.