Related papers: SneakyPrompt: Jailbreaking Text-to-image Generative Models

SneakyPrompt: Jailbreaking Text-to-image Generative Models

URL: http://arxiv.org/abs/2305.12082v3
Date: Fri, 10 Nov 2023 19:15:20 GMT
Title: SneakyPrompt: Jailbreaking Text-to-image Generative Models
Authors: Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao
Abstract summary: We propose SneakyPrompt, the first automated attack framework, to jailbreak text-to-image generative models. Given a prompt that is blocked by a safety filter, SneakyPrompt repeatedly queries the text-to-image generative model and strategically perturbs tokens in the prompt based on the query results to bypass the safety filter. Our evaluation shows that SneakyPrompt not only successfully generates NSFW images, but also outperforms existing text adversarial attacks when extended to jailbreak text-to-image generative models.
Score: 20.645304189835944
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image generative models such as Stable Diffusion and DALL$\cdot$E raise many ethical concerns due to the generation of harmful images such as Not-Safe-for-Work (NSFW) ones. To address these ethical concerns, safety filters are often adopted to prevent the generation of NSFW images. In this work, we propose SneakyPrompt, the first automated attack framework, to jailbreak text-to-image generative models such that they generate NSFW images even if safety filters are adopted. Given a prompt that is blocked by a safety filter, SneakyPrompt repeatedly queries the text-to-image generative model and strategically perturbs tokens in the prompt based on the query results to bypass the safety filter. Specifically, SneakyPrompt utilizes reinforcement learning to guide the perturbation of tokens. Our evaluation shows that SneakyPrompt successfully jailbreaks DALL$\cdot$E 2 with closed-box safety filters to generate NSFW images. Moreover, we also deploy several state-of-the-art, open-source safety filters on a Stable Diffusion model. Our evaluation shows that SneakyPrompt not only successfully generates NSFW images, but also outperforms existing text adversarial attacks when extended to jailbreak text-to-image generative models, in terms of both the number of queries and qualities of the generated NSFW images. SneakyPrompt is open-source and available at this repository: \url{https://github.com/Yuchen413/text2image_safety}.

Related papers

Towards Safe Synthetic Image Generation On the Web: A Multimodal Robust NSFW Defense and Million Scale Dataset [20.758637391023345]
A multimodal defense is developed to distinguish safe and NSFW text and images. Our model performs well against existing SOTA NSFW detection methods in terms of accuracy and recall.
arXiv Detail & Related papers (2025-04-16T02:10:42Z)
Jailbreaking Safeguarded Text-to-Image Models via Large Language Models [44.253924518111695]
We propose PromptTune, a method to jailbreak text-to-image models with safety guardrails. Unlike other query-based jailbreak attacks, our attack generates adversarial prompts efficiently after fine-tuning our AttackLLM. Our results demonstrate that our approach effectively bypasses safety guardrails, outperforms existing no-box attacks, and also facilitates other query-based attacks.
arXiv Detail & Related papers (2025-03-03T18:58:46Z)
CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis with Latent Diffusion Models [13.799517170191919]
Recent research has shown that safety checkers have vulnerabilities against adversarial attacks, allowing them to generate Not Safe For Work (NSFW) images. We propose CROPS, a model-agnostic framework that easily defends against adversarial attacks generating NSFW images without requiring additional training.
arXiv Detail & Related papers (2025-01-09T16:43:21Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks. We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets. Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Buster: Implanting Semantic Backdoor into Text Encoder to Mitigate NSFW Content Generation [15.703408347981776]
We propose an innovative framework named textitBuster, which injects backdoors into the text encoder to prevent NSFW content generation. Buster leverages deep semantic information rather than explicit prompts as triggers, redirecting NSFW prompts towards targeted benign prompts. Our experiments denote that Buster outperforms nine state-of-the-art baselines, achieving a superior NSFW content removal rate of at least 91.2%.
arXiv Detail & Related papers (2024-12-10T07:18:51Z)
Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models [80.77246856082742]
Safety Snowball Agent (SSA) is a novel agent-based framework leveraging agents' autonomous and tool-using abilities to jailbreak LVLMs. Our experiments demonstrate that ours can use nearly any image to induce LVLMs to produce unsafe content, achieving high success jailbreaking rates against the latest LVLMs.
arXiv Detail & Related papers (2024-11-18T11:58:07Z)
AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models [20.37481116837779]
AdvI2I is a novel framework that manipulates input images to induce diffusion models to generate NSFW content. By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms. We show that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards.
arXiv Detail & Related papers (2024-10-28T19:15:06Z)
ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning [7.099258248662009]
There is a potential risk that text-to-image (T2I) model can generate unsafe images with uncomfortable contents. In our work, we focus on eliminating the NSFW (not safe for work) content generation from T2I model. We propose a customized reward function consisting of the CLIP (Contrastive Language-Image Pre-training) and nudity rewards to prune the nudity contents.
arXiv Detail & Related papers (2024-10-04T19:37:56Z)
Multimodal Pragmatic Jailbreak on Text-to-image Models [43.67831238116829]
This work introduces a novel type of jailbreak, which triggers T2I models to generate the image with visual text. We benchmark nine representative T2I models, including two close-source commercial models. All tested models suffer from such type of jailbreak, with rates of unsafe generation ranging from 8% to 74%.
arXiv Detail & Related papers (2024-09-27T21:23:46Z)
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger [67.75420257197186]
In this work, we propose $textbfBaThe, a simple yet effective jailbreak defense mechanism. Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses. We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks.
arXiv Detail & Related papers (2024-08-17T04:43:26Z)
Automatic Jailbreaking of the Text-to-Image Generative AI Systems [76.9697122883554]
We study the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. We propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our framework successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time.
arXiv Detail & Related papers (2024-05-26T13:32:24Z)
Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models [10.70975463369742]
We present the Jailbreaking Prompt Attack (JPA) JPA searches for the target malicious concepts in the text embedding space using a group of antonyms. A prefix prompt is optimized in the discrete vocabulary space to align malicious concepts semantically in the text embedding space.
arXiv Detail & Related papers (2024-04-02T09:49:35Z)
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)
SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution [21.93748586123046]
We develop and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant NSFW images. Our framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules. Results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts.
arXiv Detail & Related papers (2023-09-25T13:20:15Z)
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation. This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models. Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z)
FLIRT: Feedback Loop In-context Red Teaming [79.63896510559357]
We propose an automatic red teaming framework that evaluates a given black-box model and exposes its vulnerabilities. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation.
arXiv Detail & Related papers (2023-08-08T14:03:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.