Related papers: Automatic Jailbreaking of the Text-to-Image Generative AI Systems

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

URL: http://arxiv.org/abs/2405.16567v2
Date: Tue, 28 May 2024 06:37:00 GMT
Title: Automatic Jailbreaking of the Text-to-Image Generative AI Systems
Authors: Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang,
Abstract summary: We study the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. We propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our framework successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time.
Score: 76.9697122883554
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there are diverse safety risks that can cause the generation of malicious contents by circumventing the alignment in LLMs, which are often referred to as jailbreaking. However, most of the previous works only focused on the text-based jailbreaking in LLMs, and the jailbreaking of the text-to-image (T2I) generation system has been relatively overlooked. In this paper, we first evaluate the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. From this empirical study, we find that Copilot and Gemini block only 12% and 17% of the attacks with naive prompts, respectively, while ChatGPT blocks 84% of them. Then, we further propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our automated jailbreaking framework leverages an LLM optimizer to generate prompts to maximize degree of violation from the generated images without any weight updates or gradient computation. Surprisingly, our simple yet effective approach successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time. Finally, we explore various defense strategies, such as post-generation filtering and machine unlearning techniques, but found that they were inadequate, which suggests the necessity of stronger defense mechanisms.

Related papers

Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction [32.04296423547049]
Large Language Models (LLMs) are widely applied in various domains. We propose the Rewrite to Jailbreak (R2J) approach, a transferable black-box jailbreak method to attack LLMs.
arXiv Detail & Related papers (2025-02-16T11:43:39Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks. We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets. Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves [67.30731020715496]
We propose a novel jailbreak method named IDEATOR, which autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR uses a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model. It achieves a 94% success rate in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high success rates of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Meta's Chameleon.
arXiv Detail & Related papers (2024-10-29T07:15:56Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts. It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks. Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs [33.87649859430635]
Large Language Models (LLMs) have excelled in various tasks but are still vulnerable to jailbreaking attacks. We introduce a novel jailbreaking attack framework that adapts the black-box fuzz testing approach with a series of customized designs. Our method achieves attack success rates of over 90%,80% and 74%, respectively, exceeding existing baselines by more than 60%.
arXiv Detail & Related papers (2024-09-23T10:03:09Z)
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses [37.56003689042975]
Many-shot (up to hundreds) demonstrations can jailbreak state-of-the-art LLMs by exploiting their long-context capability. We propose improved techniques such as injecting special system tokens like [/INST] and employing demo-level random search from a collected demo pool.
arXiv Detail & Related papers (2024-06-03T12:59:17Z)
Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models [29.312244478583665]
generative AI has enabled ubiquitous access to large language models (LLMs) Jailbreak prompts have emerged as one of the most effective mechanisms to circumvent security restrictions and elicit harmful content originally designed to be prohibited. We show that users often succeeded in jailbreak prompts generation regardless of their expertise in LLMs. We also develop a system using AI as the assistant to automate the process of jailbreak prompt generation.
arXiv Detail & Related papers (2024-03-26T02:47:42Z)
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts [64.60375604495883]
We discover a system prompt leakage vulnerability in GPT-4V. By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts. We also evaluate the effect of modifying system prompts to defend against jailbreaking attacks.
arXiv Detail & Related papers (2023-11-15T17:17:39Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.