SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via
Substitution
- URL: http://arxiv.org/abs/2309.14122v1
- Date: Mon, 25 Sep 2023 13:20:15 GMT
- Title: SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via
Substitution
- Authors: Zhongjie Ba, Jieming Zhong, Jiachen Lei, Peng Cheng, Qinglong Wang,
Zhan Qin, Zhibo Wang, Kui Ren
- Abstract summary: We develop and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant NSFW images.
Our framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules.
Results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts.
- Score: 22.882337899780968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advanced text-to-image models such as DALL-E 2 and Midjourney possess the
capacity to generate highly realistic images, raising significant concerns
regarding the potential proliferation of unsafe content. This includes adult,
violent, or deceptive imagery of political figures. Despite claims of rigorous
safety mechanisms implemented in these models to restrict the generation of
not-safe-for-work (NSFW) content, we successfully devise and exhibit the first
prompt attacks on Midjourney, resulting in the production of abundant
photorealistic NSFW images. We reveal the fundamental principles of such prompt
attacks and suggest strategically substituting high-risk sections within a
suspect prompt to evade closed-source safety measures. Our novel framework,
SurrogatePrompt, systematically generates attack prompts, utilizing large
language models, image-to-text, and image-to-image modules to automate attack
prompt creation at scale. Evaluation results disclose an 88% success rate in
bypassing Midjourney's proprietary safety filter with our attack prompts,
leading to the generation of counterfeit images depicting political figures in
violent scenarios. Both subjective and objective assessments validate that the
images generated from our attack prompts present considerable safety hazards.
Related papers
- MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - Principles of Designing Robust Remote Face Anti-Spoofing Systems [60.05766968805833]
This paper sheds light on the vulnerabilities of state-of-the-art face anti-spoofing methods against digital attacks.
It presents a comprehensive taxonomy of common threats encountered in face anti-spoofing systems.
arXiv Detail & Related papers (2024-06-06T02:05:35Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users [18.3621509910395]
We propose a novel Automatic Red-Teaming framework, ART, to evaluate the safety risks of text-to-image models.
With our comprehensive experiments, we reveal the toxicity of the popular open-source text-to-image models.
We also introduce three large-scale red-teaming datasets for studying the safety risks associated with text-to-image models.
arXiv Detail & Related papers (2024-05-24T07:44:27Z) - SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models [28.23494821842336]
Text-to-image models may be tricked into generating not-safe-for-work (NSFW) content, particularly in sexual scenarios.
We present SafeGen, a framework to mitigate unsafe content generation by text-to-image models in a text-agnostic manner.
arXiv Detail & Related papers (2024-04-10T00:26:08Z) - On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts [38.63253101205306]
Previous studies have successfully demonstrated that manipulated prompts can elicit text-to-image models to generate unsafe images.
We propose two poisoning attacks: a basic attack and a utility-preserving attack.
Our findings underscore the potential risks of adopting text-to-image models in real-world scenarios.
arXiv Detail & Related papers (2023-10-25T13:10:44Z) - Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models.
It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content.
Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z) - Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation.
This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models.
Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z) - FLIRT: Feedback Loop In-context Red Teaming [71.38594755628581]
We propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities.
Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation.
arXiv Detail & Related papers (2023-08-08T14:03:08Z) - Membership Inference Attacks Against Text-to-image Generation Models [23.39695974954703]
This paper performs the first privacy analysis of text-to-image generation models through the lens of membership inference.
We propose three key intuitions about membership information and design four attack methodologies accordingly.
All of the proposed attacks can achieve significant performance, in some cases even close to an accuracy of 1, and thus the corresponding risk is much more severe than that shown by existing membership inference attacks.
arXiv Detail & Related papers (2022-10-03T14:31:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.