On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
- URL: http://arxiv.org/abs/2310.16613v1
- Date: Wed, 25 Oct 2023 13:10:44 GMT
- Title: On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
- Authors: Yixin Wu, Ning Yu, Michael Backes, Yun Shen, Yang Zhang,
- Abstract summary: Previous studies have successfully demonstrated that manipulated prompts can elicit text-to-image models to generate unsafe images.
We propose two poisoning attacks: a basic attack and a utility-preserving attack.
Our findings underscore the potential risks of adopting text-to-image models in real-world scenarios.
- Score: 38.63253101205306
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image models like Stable Diffusion have had a profound impact on daily life by enabling the generation of photorealistic images from textual prompts, fostering creativity, and enhancing visual experiences across various applications. However, these models also pose risks. Previous studies have successfully demonstrated that manipulated prompts can elicit text-to-image models to generate unsafe images, e.g., hateful meme variants. Yet, these studies only unleash the harmful power of text-to-image models in a passive manner. In this work, we focus on the proactive generation of unsafe images using targeted benign prompts via poisoning attacks. We propose two poisoning attacks: a basic attack and a utility-preserving attack. We qualitatively and quantitatively evaluate the proposed attacks using four representative hateful memes and multiple query prompts. Experimental results indicate that text-to-image models are vulnerable to the basic attack even with five poisoning samples. However, the poisoning effect can inadvertently spread to non-targeted prompts, leading to undesirable side effects. Root cause analysis identifies conceptual similarity as an important contributing factor to the side effects. To address this, we introduce the utility-preserving attack as a viable mitigation strategy to maintain the attack stealthiness, while ensuring decent attack performance. Our findings underscore the potential risks of adopting text-to-image models in real-world scenarios, calling for future research and safety measures in this space.
Related papers
- Defending Text-to-image Diffusion Models: Surprising Efficacy of Textual Perturbations Against Backdoor Attacks [7.777211995715721]
We show that state-of-the-art backdoor attacks against text-to-image diffusion models can be effectively mitigated by a surprisingly simple defense strategy - textual perturbation.
Experiments show that textual perturbations are effective in defending against state-of-the-art backdoor attacks with minimal sacrifice to generation quality.
arXiv Detail & Related papers (2024-08-28T11:36:43Z) - MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models [58.065255696601604]
We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation.
We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary.
arXiv Detail & Related papers (2024-04-21T16:35:16Z) - Object-oriented backdoor attack against image captioning [40.5688859498834]
Backdoor attack against image classification task has been widely studied and proven to be successful.
In this paper, we explore backdoor attack towards image captioning models by poisoning training data.
Our method proves the weakness of image captioning models to backdoor attack and we hope this work can raise the awareness of defending against backdoor attack in the image captioning field.
arXiv Detail & Related papers (2024-01-05T01:52:13Z) - SA-Attack: Improving Adversarial Transferability of Vision-Language
Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios.
We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z) - Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models [26.301156075883483]
We show that poisoning attacks can be successful on generative models.
We introduce Nightshade, an optimized prompt-specific poisoning attack.
We show that Nightshade attacks can destabilize general features in a text-to-image generative model.
arXiv Detail & Related papers (2023-10-20T21:54:10Z) - SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution [21.93748586123046]
We develop and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant NSFW images.
Our framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules.
Results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts.
arXiv Detail & Related papers (2023-09-25T13:20:15Z) - Adversarial Examples Make Strong Poisons [55.63469396785909]
We show that adversarial examples, originally intended for attacking pre-trained models, are even more effective for data poisoning than recent methods designed specifically for poisoning.
Our method, adversarial poisoning, is substantially more effective than existing poisoning methods for secure dataset release.
arXiv Detail & Related papers (2021-06-21T01:57:14Z) - Deep Image Destruction: A Comprehensive Study on Vulnerability of Deep
Image-to-Image Models against Adversarial Attacks [104.8737334237993]
We present comprehensive investigations into the vulnerability of deep image-to-image models to adversarial attacks.
For five popular image-to-image tasks, 16 deep models are analyzed from various standpoints.
We show that unlike in image classification tasks, the performance degradation on image-to-image tasks can largely differ depending on various factors.
arXiv Detail & Related papers (2021-04-30T14:20:33Z) - Backdooring and Poisoning Neural Networks with Image-Scaling Attacks [15.807243762876901]
We propose a novel strategy for hiding backdoor and poisoning attacks.
Our approach builds on a recent class of attacks against image scaling.
We show that backdoors and poisoning work equally well when combined with image-scaling attacks.
arXiv Detail & Related papers (2020-03-19T08:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.