Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes
From Text-To-Image Models
- URL: http://arxiv.org/abs/2305.13873v2
- Date: Wed, 16 Aug 2023 11:16:15 GMT
- Title: Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes
From Text-To-Image Models
- Authors: Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou,
Yang Zhang
- Abstract summary: State-of-the-art Text-to-Image models like Stable Diffusion and DALLE$cdot$2 are revolutionizing how people generate visual content.
We focus on demystifying the generation of unsafe images and hateful memes from Text-to-Image models.
- Score: 44.10698490171833
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: State-of-the-art Text-to-Image models like Stable Diffusion and DALLE$\cdot$2
are revolutionizing how people generate visual content. At the same time,
society has serious concerns about how adversaries can exploit such models to
generate unsafe images. In this work, we focus on demystifying the generation
of unsafe images and hateful memes from Text-to-Image models. We first
construct a typology of unsafe images consisting of five categories (sexually
explicit, violent, disturbing, hateful, and political). Then, we assess the
proportion of unsafe images generated by four advanced Text-to-Image models
using four prompt datasets. We find that these models can generate a
substantial percentage of unsafe images; across four models and four prompt
datasets, 14.56% of all generated images are unsafe. When comparing the four
models, we find different risk levels, with Stable Diffusion being the most
prone to generating unsafe content (18.92% of all generated images are unsafe).
Given Stable Diffusion's tendency to generate more unsafe content, we evaluate
its potential to generate hateful meme variants if exploited by an adversary to
attack a specific individual or community. We employ three image editing
methods, DreamBooth, Textual Inversion, and SDEdit, which are supported by
Stable Diffusion. Our evaluation result shows that 24% of the generated images
using DreamBooth are hateful meme variants that present the features of the
original hateful meme and the target individual/community; these generated
images are comparable to hateful meme variants collected from the real world.
Overall, our results demonstrate that the danger of large-scale generation of
unsafe images is imminent. We discuss several mitigating measures, such as
curating training data, regulating prompts, and implementing safety filters,
and encourage better safeguard tools to be developed to prevent unsafe
generation.
Related papers
- Towards Understanding Unsafe Video Generation [10.269782780518428]
Video generation models (VGMs) have demonstrated the capability to synthesize high-quality output.
We identify 5 unsafe video categories: Distorted/Weird, Terrifying, Pornographic, Violent/Bloody, and Political.
We then study possible defense mechanisms to prevent the generation of unsafe videos.
arXiv Detail & Related papers (2024-07-17T14:07:22Z) - Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts.
The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts.
concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z) - Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Latent Guard is a framework designed to improve safety measures in text-to-image generation.
Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder.
Our proposed framework is composed of a data generation pipeline specific to the task.
arXiv Detail & Related papers (2024-04-11T17:59:52Z) - SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models [28.23494821842336]
Text-to-image models may be tricked into generating not-safe-for-work (NSFW) content, particularly in sexual scenarios.
We present SafeGen, a framework to mitigate unsafe content generation by text-to-image models in a text-agnostic manner.
arXiv Detail & Related papers (2024-04-10T00:26:08Z) - Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models [42.19184265811366]
We introduce a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs.
We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences.
arXiv Detail & Related papers (2023-11-27T19:02:17Z) - On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts [38.63253101205306]
Previous studies have successfully demonstrated that manipulated prompts can elicit text-to-image models to generate unsafe images.
We propose two poisoning attacks: a basic attack and a utility-preserving attack.
Our findings underscore the potential risks of adopting text-to-image models in real-world scenarios.
arXiv Detail & Related papers (2023-10-25T13:10:44Z) - Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation.
This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models.
Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z) - Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion
Models [63.20512617502273]
We propose a method called SDD to prevent problematic content generation in text-to-image diffusion models.
Our method eliminates a much greater proportion of harmful content from the generated images without degrading the overall image quality.
arXiv Detail & Related papers (2023-07-12T07:48:29Z) - DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models [79.71665540122498]
We propose a method for detecting unauthorized data usage by planting the injected content into the protected dataset.
Specifically, we modify the protected images by adding unique contents on these images using stealthy image warping functions.
By analyzing whether the model has memorized the injected content, we can detect models that had illegally utilized the unauthorized data.
arXiv Detail & Related papers (2023-07-06T16:27:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.