Related papers: Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

URL: http://arxiv.org/abs/2305.13873v2
Date: Wed, 16 Aug 2023 11:16:15 GMT
Title: Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
Authors: Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, Yang Zhang
Abstract summary: State-of-the-art Text-to-Image models like Stable Diffusion and DALLE$cdot$2 are revolutionizing how people generate visual content. We focus on demystifying the generation of unsafe images and hateful memes from Text-to-Image models.
Score: 44.10698490171833
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: State-of-the-art Text-to-Image models like Stable Diffusion and DALLE$\cdot$2 are revolutionizing how people generate visual content. At the same time, society has serious concerns about how adversaries can exploit such models to generate unsafe images. In this work, we focus on demystifying the generation of unsafe images and hateful memes from Text-to-Image models. We first construct a typology of unsafe images consisting of five categories (sexually explicit, violent, disturbing, hateful, and political). Then, we assess the proportion of unsafe images generated by four advanced Text-to-Image models using four prompt datasets. We find that these models can generate a substantial percentage of unsafe images; across four models and four prompt datasets, 14.56% of all generated images are unsafe. When comparing the four models, we find different risk levels, with Stable Diffusion being the most prone to generating unsafe content (18.92% of all generated images are unsafe). Given Stable Diffusion's tendency to generate more unsafe content, we evaluate its potential to generate hateful meme variants if exploited by an adversary to attack a specific individual or community. We employ three image editing methods, DreamBooth, Textual Inversion, and SDEdit, which are supported by Stable Diffusion. Our evaluation result shows that 24% of the generated images using DreamBooth are hateful meme variants that present the features of the original hateful meme and the target individual/community; these generated images are comparable to hateful meme variants collected from the real world. Overall, our results demonstrate that the danger of large-scale generation of unsafe images is imminent. We discuss several mitigating measures, such as curating training data, regulating prompts, and implementing safety filters, and encourage better safeguard tools to be developed to prevent unsafe generation.

Related papers

Towards Safe Synthetic Image Generation On the Web: A Multimodal Robust NSFW Defense and Million Scale Dataset [20.758637391023345]
A multimodal defense is developed to distinguish safe and NSFW text and images. Our model performs well against existing SOTA NSFW detection methods in terms of accuracy and recall.
arXiv Detail & Related papers (2025-04-16T02:10:42Z)
ShieldGemma 2: Robust and Tractable Image Content Moderation [63.36923375135708]
ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3. This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence & Gore, and Dangerous Content for synthetic images.
arXiv Detail & Related papers (2025-04-01T18:00:20Z)
SafeText: Safe Text-to-image Models via Aligning the Text Encoder [38.14026164194725]
Text-to-image models can generate harmful images when presented with unsafe prompts. We propose SafeText, a novel alignment method that fine-tunes the text encoder rather than the diffusion module. Our results show that SafeText effectively prevents harmful image generation with minor impact on the images for safe prompts.
arXiv Detail & Related papers (2025-02-28T01:02:57Z)
Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images [5.150015329535525]
We show that all state-of-the-art DMs (e.g., SD3, SDXL, Flux, DeepFloyd IF) are vulnerable to this issue.<n>We introduce a novel fine-tuning strategy that targets only the text-generation layers in DMs.<n>To advance research in the area, we release ToxicBench, an open-source benchmark for evaluating NSFW text generation in images.
arXiv Detail & Related papers (2025-02-07T16:39:39Z)
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [49.60774626839712]
Training multimodal generative models can expose users to harmful, unsafe and controversial or culturally-inappropriate outputs. We propose a modular, dynamic solution that leverages safety-context embeddings and a dual reconstruction process to generate safer images. We achieve state-of-the-art results on safe image generation benchmarks, while offering controllable variation of model safety.
arXiv Detail & Related papers (2024-11-21T09:47:13Z)
ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning [7.099258248662009]
There is a potential risk that text-to-image (T2I) model can generate unsafe images with uncomfortable contents. In our work, we focus on eliminating the NSFW (not safe for work) content generation from T2I model. We propose a customized reward function consisting of the CLIP (Contrastive Language-Image Pre-training) and nudity rewards to prune the nudity contents.
arXiv Detail & Related papers (2024-10-04T19:37:56Z)
Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step [62.82566977845765]
We introduce a novel jailbreaking method called Chain-of-Jailbreak (CoJ) attack, which compromises image generation models through a step-by-step editing process. Our CoJ attack method can successfully bypass the safeguards of models for over 60% cases. We also propose an effective prompting-based method, Think Twice Prompting, that can successfully defend over 95% of CoJ attack.
arXiv Detail & Related papers (2024-10-04T19:04:43Z)
Multimodal Pragmatic Jailbreak on Text-to-image Models [43.67831238116829]
This work introduces a novel type of jailbreak, which triggers T2I models to generate the image with visual text. We benchmark nine representative T2I models, including two close-source commercial models. All tested models suffer from such type of jailbreak, with rates of unsafe generation ranging from 8% to 74%.
arXiv Detail & Related papers (2024-09-27T21:23:46Z)
Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification. We propose Latent Guard, a framework designed to improve safety measures in text-to-image generation. Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts.
arXiv Detail & Related papers (2024-04-11T17:59:52Z)
SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models [28.23494821842336]
Text-to-image models may be tricked into generating not-safe-for-work (NSFW) content. We present SafeGen, a framework to mitigate sexual content generation by text-to-image models.
arXiv Detail & Related papers (2024-04-10T00:26:08Z)
On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts [38.63253101205306]
Previous studies have successfully demonstrated that manipulated prompts can elicit text-to-image models to generate unsafe images. We propose two poisoning attacks: a basic attack and a utility-preserving attack. Our findings underscore the potential risks of adopting text-to-image models in real-world scenarios.
arXiv Detail & Related papers (2023-10-25T13:10:44Z)
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation. This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models. Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z)
Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models [63.20512617502273]
We propose a method called SDD to prevent problematic content generation in text-to-image diffusion models. Our method eliminates a much greater proportion of harmful content from the generated images without degrading the overall image quality.
arXiv Detail & Related papers (2023-07-12T07:48:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.