SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution
- URL: http://arxiv.org/abs/2309.14122v3
- Date: Thu, 17 Oct 2024 03:34:21 GMT
- Title: SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution
- Authors: Zhongjie Ba, Jieming Zhong, Jiachen Lei, Peng Cheng, Qinglong Wang, Zhan Qin, Zhibo Wang, Kui Ren,
- Abstract summary: We develop and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant NSFW images.
Our framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules.
Results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts.
- Score: 21.93748586123046
- License:
- Abstract: Advanced text-to-image models such as DALL$\cdot$E 2 and Midjourney possess the capacity to generate highly realistic images, raising significant concerns regarding the potential proliferation of unsafe content. This includes adult, violent, or deceptive imagery of political figures. Despite claims of rigorous safety mechanisms implemented in these models to restrict the generation of not-safe-for-work (NSFW) content, we successfully devise and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant photorealistic NSFW images. We reveal the fundamental principles of such prompt attacks and suggest strategically substituting high-risk sections within a suspect prompt to evade closed-source safety measures. Our novel framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules to automate attack prompt creation at scale. Evaluation results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts, leading to the generation of counterfeit images depicting political figures in violent scenarios. Both subjective and objective assessments validate that the images generated from our attack prompts present considerable safety hazards.
Related papers
- AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models [20.37481116837779]
AdvI2I is a novel framework that manipulates input images to induce diffusion models to generate NSFW content.
By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms.
We show that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards.
arXiv Detail & Related papers (2024-10-28T19:15:06Z) - TrojVLM: Backdoor Attack Against Vision Language Models [50.87239635292717]
This study introduces TrojVLM, the first exploration of backdoor attacks aimed at Vision Language Models (VLMs)
TrojVLM inserts predetermined target text into output text when encountering poisoned images.
A novel semantic preserving loss is proposed to ensure the semantic integrity of the original image content.
arXiv Detail & Related papers (2024-09-28T04:37:09Z) - EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts [32.590822043053734]
Non-toxic text still carries a risk of generating non-compliant images, which is referred to as implicit unsafe prompts.
We propose a simple yet effective approach that incorporates non-compliant concepts into an erasure prompt.
Our method exhibits superior erasure effectiveness while achieving high scores in image fidelity.
arXiv Detail & Related papers (2024-08-02T05:17:14Z) - MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - Principles of Designing Robust Remote Face Anti-Spoofing Systems [60.05766968805833]
This paper sheds light on the vulnerabilities of state-of-the-art face anti-spoofing methods against digital attacks.
It presents a comprehensive taxonomy of common threats encountered in face anti-spoofing systems.
arXiv Detail & Related papers (2024-06-06T02:05:35Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users [18.3621509910395]
We propose a novel Automatic Red-Teaming framework, ART, to evaluate the safety risks of text-to-image models.
With our comprehensive experiments, we reveal the toxicity of the popular open-source text-to-image models.
We also introduce three large-scale red-teaming datasets for studying the safety risks associated with text-to-image models.
arXiv Detail & Related papers (2024-05-24T07:44:27Z) - On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts [38.63253101205306]
Previous studies have successfully demonstrated that manipulated prompts can elicit text-to-image models to generate unsafe images.
We propose two poisoning attacks: a basic attack and a utility-preserving attack.
Our findings underscore the potential risks of adopting text-to-image models in real-world scenarios.
arXiv Detail & Related papers (2023-10-25T13:10:44Z) - Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models.
It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content.
Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z) - Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation.
This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models.
Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.