Related papers: On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

URL: http://arxiv.org/abs/2310.16613v1
Date: Wed, 25 Oct 2023 13:10:44 GMT
Title: On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
Authors: Yixin Wu, Ning Yu, Michael Backes, Yun Shen, Yang Zhang,
Abstract summary: Previous studies have successfully demonstrated that manipulated prompts can elicit text-to-image models to generate unsafe images. We propose two poisoning attacks: a basic attack and a utility-preserving attack. Our findings underscore the potential risks of adopting text-to-image models in real-world scenarios.
Score: 38.63253101205306
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image models like Stable Diffusion have had a profound impact on daily life by enabling the generation of photorealistic images from textual prompts, fostering creativity, and enhancing visual experiences across various applications. However, these models also pose risks. Previous studies have successfully demonstrated that manipulated prompts can elicit text-to-image models to generate unsafe images, e.g., hateful meme variants. Yet, these studies only unleash the harmful power of text-to-image models in a passive manner. In this work, we focus on the proactive generation of unsafe images using targeted benign prompts via poisoning attacks. We propose two poisoning attacks: a basic attack and a utility-preserving attack. We qualitatively and quantitatively evaluate the proposed attacks using four representative hateful memes and multiple query prompts. Experimental results indicate that text-to-image models are vulnerable to the basic attack even with five poisoning samples. However, the poisoning effect can inadvertently spread to non-targeted prompts, leading to undesirable side effects. Root cause analysis identifies conceptual similarity as an important contributing factor to the side effects. To address this, we introduce the utility-preserving attack as a viable mitigation strategy to maintain the attack stealthiness, while ensuring decent attack performance. Our findings underscore the potential risks of adopting text-to-image models in real-world scenarios, calling for future research and safety measures in this space.

Related papers

Clean Image May be Dangerous: Data Poisoning Attacks Against Deep Hashing [71.30876587855867]
We show that even clean query images can be dangerous, inducing malicious target retrieval results, like undesired or illegal images. Specifically, we first train a surrogate model to simulate the behavior of the target deep hashing model. Then, a strict gradient matching strategy is proposed to generate the poisoned images.
arXiv Detail & Related papers (2025-03-27T07:54:27Z)
CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis with Latent Diffusion Models [13.799517170191919]
Recent research has shown that safety checkers have vulnerabilities against adversarial attacks, allowing them to generate Not Safe For Work (NSFW) images. We propose CROPS, a model-agnostic framework that easily defends against adversarial attacks generating NSFW images without requiring additional training.
arXiv Detail & Related papers (2025-01-09T16:43:21Z)
When Image Generation Goes Wrong: A Safety Analysis of Stable Diffusion Models [0.0]
This study investigates the ability of ten popular Stable Diffusion models to generate harmful images. We demonstrate that these models respond to harmful prompts by generating inappropriate content. Our findings demonstrate a complete lack of any refusal behavior or safety measures in the models observed.
arXiv Detail & Related papers (2024-11-23T10:42:43Z)
Defending Text-to-image Diffusion Models: Surprising Efficacy of Textual Perturbations Against Backdoor Attacks [7.777211995715721]
We show that state-of-the-art backdoor attacks against text-to-image diffusion models can be effectively mitigated by a surprisingly simple defense strategy - textual perturbation. Experiments show that textual perturbations are effective in defending against state-of-the-art backdoor attacks with minimal sacrifice to generation quality.
arXiv Detail & Related papers (2024-08-28T11:36:43Z)
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models. Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs. Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z)
Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models [58.065255696601604]
We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation. We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary.
arXiv Detail & Related papers (2024-04-21T16:35:16Z)
Revealing Vulnerabilities in Stable Diffusion via Targeted Attacks [41.531913152661296]
We formulate the problem of targeted adversarial attack on Stable Diffusion and propose a framework to generate adversarial prompts. Specifically, we design a gradient-based embedding optimization method to craft reliable adversarial prompts that guide stable diffusion to generate specific images. After obtaining successful adversarial prompts, we reveal the mechanisms that cause the vulnerability of the model.
arXiv Detail & Related papers (2024-01-16T12:15:39Z)
Object-oriented backdoor attack against image captioning [40.5688859498834]
Backdoor attack against image classification task has been widely studied and proven to be successful. In this paper, we explore backdoor attack towards image captioning models by poisoning training data. Our method proves the weakness of image captioning models to backdoor attack and we hope this work can raise the awareness of defending against backdoor attack in the image captioning field.
arXiv Detail & Related papers (2024-01-05T01:52:13Z)
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios. We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z)
Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models [26.301156075883483]
We show that poisoning attacks can be successful on generative models. We introduce Nightshade, an optimized prompt-specific poisoning attack. We show that Nightshade attacks can destabilize general features in a text-to-image generative model.
arXiv Detail & Related papers (2023-10-20T21:54:10Z)
SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution [21.93748586123046]
We develop and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant NSFW images. Our framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules. Results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts.
arXiv Detail & Related papers (2023-09-25T13:20:15Z)
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation. This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models. Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z)
Adversarial Examples Make Strong Poisons [55.63469396785909]
We show that adversarial examples, originally intended for attacking pre-trained models, are even more effective for data poisoning than recent methods designed specifically for poisoning. Our method, adversarial poisoning, is substantially more effective than existing poisoning methods for secure dataset release.
arXiv Detail & Related papers (2021-06-21T01:57:14Z)
Deep Image Destruction: A Comprehensive Study on Vulnerability of Deep Image-to-Image Models against Adversarial Attacks [104.8737334237993]
We present comprehensive investigations into the vulnerability of deep image-to-image models to adversarial attacks. For five popular image-to-image tasks, 16 deep models are analyzed from various standpoints. We show that unlike in image classification tasks, the performance degradation on image-to-image tasks can largely differ depending on various factors.
arXiv Detail & Related papers (2021-04-30T14:20:33Z)
Adversarial Examples Detection beyond Image Space [88.7651422751216]
We find that there exists compliance between perturbations and prediction confidence, which guides us to detect few-perturbation attacks from the aspect of prediction confidence. We propose a method beyond image space by a two-stream architecture, in which the image stream focuses on the pixel artifacts and the gradient stream copes with the confidence artifacts.
arXiv Detail & Related papers (2021-02-23T09:55:03Z)
Backdooring and Poisoning Neural Networks with Image-Scaling Attacks [15.807243762876901]
We propose a novel strategy for hiding backdoor and poisoning attacks. Our approach builds on a recent class of attacks against image scaling. We show that backdoors and poisoning work equally well when combined with image-scaling attacks.
arXiv Detail & Related papers (2020-03-19T08:59:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.