SneakyPrompt: Jailbreaking Text-to-image Generative Models
- URL: http://arxiv.org/abs/2305.12082v3
- Date: Fri, 10 Nov 2023 19:15:20 GMT
- Title: SneakyPrompt: Jailbreaking Text-to-image Generative Models
- Authors: Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao
- Abstract summary: We propose SneakyPrompt, the first automated attack framework, to jailbreak text-to-image generative models.
Given a prompt that is blocked by a safety filter, SneakyPrompt repeatedly queries the text-to-image generative model and strategically perturbs tokens in the prompt based on the query results to bypass the safety filter.
Our evaluation shows that SneakyPrompt not only successfully generates NSFW images, but also outperforms existing text adversarial attacks when extended to jailbreak text-to-image generative models.
- Score: 20.645304189835944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image generative models such as Stable Diffusion and DALL$\cdot$E
raise many ethical concerns due to the generation of harmful images such as
Not-Safe-for-Work (NSFW) ones. To address these ethical concerns, safety
filters are often adopted to prevent the generation of NSFW images. In this
work, we propose SneakyPrompt, the first automated attack framework, to
jailbreak text-to-image generative models such that they generate NSFW images
even if safety filters are adopted. Given a prompt that is blocked by a safety
filter, SneakyPrompt repeatedly queries the text-to-image generative model and
strategically perturbs tokens in the prompt based on the query results to
bypass the safety filter. Specifically, SneakyPrompt utilizes reinforcement
learning to guide the perturbation of tokens. Our evaluation shows that
SneakyPrompt successfully jailbreaks DALL$\cdot$E 2 with closed-box safety
filters to generate NSFW images. Moreover, we also deploy several
state-of-the-art, open-source safety filters on a Stable Diffusion model. Our
evaluation shows that SneakyPrompt not only successfully generates NSFW images,
but also outperforms existing text adversarial attacks when extended to
jailbreak text-to-image generative models, in terms of both the number of
queries and qualities of the generated NSFW images. SneakyPrompt is open-source
and available at this repository:
\url{https://github.com/Yuchen413/text2image_safety}.
Related papers
- Automatic Jailbreaking of the Text-to-Image Generative AI Systems [76.9697122883554]
We study the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts.
We propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards.
Our framework successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time.
arXiv Detail & Related papers (2024-05-26T13:32:24Z) - SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models [28.23494821842336]
Text-to-image models may be tricked into generating not-safe-for-work (NSFW) content, particularly in sexual scenarios.
We present SafeGen, a framework to mitigate unsafe content generation by text-to-image models in a text-agnostic manner.
arXiv Detail & Related papers (2024-04-10T00:26:08Z) - Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models [11.24680299774092]
We propose the Jailbreak Prompt Attack (JPA) - an automatic attack framework.
We aim to maintain prompts that bypass safety checkers while preserving the semantics of the original images.
Our evaluation demonstrates that JPA successfully bypasses both online services with closed-box safety checkers and offline defenses safety checkers to generate NSFW images.
arXiv Detail & Related papers (2024-04-02T09:49:35Z) - BSPA: Exploring Black-box Stealthy Prompt Attacks against Image
Generators [43.23698370787517]
Large image generators offer significant transformative potential across diverse sectors.
Some studies reveal that image generators are notably susceptible to attacks and generate Not Suitable For Work (NSFW) contents.
We introduce a black-box stealthy prompt attack that adopts a retriever to simulate attacks from API users.
arXiv Detail & Related papers (2024-02-23T09:28:16Z) - Breaking Free: How to Hack Safety Guardrails in Black-Box Diffusion Models! [52.0855711767075]
EvoSeed is an evolutionary strategy-based algorithmic framework for generating photo-realistic natural adversarial samples.
We employ CMA-ES to optimize the search for an initial seed vector, which, when processed by the Conditional Diffusion Model, results in the natural adversarial sample misclassified by the Model.
Experiments show that generated adversarial images are of high image quality, raising concerns about generating harmful content bypassing safety classifiers.
arXiv Detail & Related papers (2024-02-07T09:39:29Z) - AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models.
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z) - SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via
Substitution [22.882337899780968]
We develop and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant NSFW images.
Our framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules.
Results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts.
arXiv Detail & Related papers (2023-09-25T13:20:15Z) - Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation.
This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models.
Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z) - FLIRT: Feedback Loop In-context Red Teaming [71.38594755628581]
We propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities.
Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation.
arXiv Detail & Related papers (2023-08-08T14:03:08Z) - If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt.
We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts.
We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z) - Rickrolling the Artist: Injecting Backdoors into Text Encoders for
Text-to-Image Synthesis [16.421253324649555]
We introduce backdoor attacks against text-guided generative models.
Our attacks only slightly alter an encoder so that no suspicious model behavior is apparent for image generations with clean prompts.
arXiv Detail & Related papers (2022-11-04T12:36:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.