On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
- URL: http://arxiv.org/abs/2310.16613v2
- Date: Wed, 05 Feb 2025 08:16:21 GMT
- Title: On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
- Authors: Yixin Wu, Ning Yu, Michael Backes, Yun Shen, Yang Zhang,
- Abstract summary: Malicious or manipulated prompts are known to exploit text-to-image models to generate unsafe images.
This paper investigates the proactive generation of unsafe images from benign prompts through maliciously modified text-to-image models.
We propose a stealthy poisoning attack method that balances covertness and performance.
- Score: 38.63253101205306
- License:
- Abstract: Malicious or manipulated prompts are known to exploit text-to-image models to generate unsafe images. Existing studies, however, focus on the passive exploitation of such harmful capabilities. In this paper, we investigate the proactive generation of unsafe images from benign prompts (e.g., a photo of a cat) through maliciously modified text-to-image models. Our preliminary investigation demonstrates that poisoning attacks are a viable method to achieve this goal but uncovers significant side effects, where unintended spread to non-targeted prompts compromises attack stealthiness. Root cause analysis identifies conceptual similarity as an important contributing factor to these side effects. To address this, we propose a stealthy poisoning attack method that balances covertness and performance. Our findings highlight the potential risks of adopting text-to-image models in real-world scenarios, thereby calling for future research and safety measures in this space.
Related papers
- CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis with Latent Diffusion Models [13.799517170191919]
Recent research has shown that safety checkers have vulnerabilities against adversarial attacks, allowing them to generate Not Safe For Work (NSFW) images.
We propose CROPS, a model-agnostic framework that easily defends against adversarial attacks generating NSFW images without requiring additional training.
arXiv Detail & Related papers (2025-01-09T16:43:21Z) - When Image Generation Goes Wrong: A Safety Analysis of Stable Diffusion Models [0.0]
This study investigates the ability of ten popular Stable Diffusion models to generate harmful images.
We demonstrate that these models respond to harmful prompts by generating inappropriate content.
Our findings demonstrate a complete lack of any refusal behavior or safety measures in the models observed.
arXiv Detail & Related papers (2024-11-23T10:42:43Z) - Imperceptible Face Forgery Attack via Adversarial Semantic Mask [59.23247545399068]
We propose an Adversarial Semantic Mask Attack framework (ASMA) which can generate adversarial examples with good transferability and invisibility.
Specifically, we propose a novel adversarial semantic mask generative model, which can constrain generated perturbations in local semantic regions for good stealthiness.
arXiv Detail & Related papers (2024-06-16T10:38:11Z) - Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models [58.065255696601604]
We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation.
We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary.
arXiv Detail & Related papers (2024-04-21T16:35:16Z) - Revealing Vulnerabilities in Stable Diffusion via Targeted Attacks [41.531913152661296]
We formulate the problem of targeted adversarial attack on Stable Diffusion and propose a framework to generate adversarial prompts.
Specifically, we design a gradient-based embedding optimization method to craft reliable adversarial prompts that guide stable diffusion to generate specific images.
After obtaining successful adversarial prompts, we reveal the mechanisms that cause the vulnerability of the model.
arXiv Detail & Related papers (2024-01-16T12:15:39Z) - SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution [21.93748586123046]
We develop and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant NSFW images.
Our framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules.
Results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts.
arXiv Detail & Related papers (2023-09-25T13:20:15Z) - Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation.
This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models.
Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z) - Deep Image Destruction: A Comprehensive Study on Vulnerability of Deep
Image-to-Image Models against Adversarial Attacks [104.8737334237993]
We present comprehensive investigations into the vulnerability of deep image-to-image models to adversarial attacks.
For five popular image-to-image tasks, 16 deep models are analyzed from various standpoints.
We show that unlike in image classification tasks, the performance degradation on image-to-image tasks can largely differ depending on various factors.
arXiv Detail & Related papers (2021-04-30T14:20:33Z) - Adversarial Examples Detection beyond Image Space [88.7651422751216]
We find that there exists compliance between perturbations and prediction confidence, which guides us to detect few-perturbation attacks from the aspect of prediction confidence.
We propose a method beyond image space by a two-stream architecture, in which the image stream focuses on the pixel artifacts and the gradient stream copes with the confidence artifacts.
arXiv Detail & Related papers (2021-02-23T09:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.