Perception-guided Jailbreak against Text-to-Image Models
- URL: http://arxiv.org/abs/2408.10848v4
- Date: Sun, 09 Feb 2025 16:13:35 GMT
- Title: Perception-guided Jailbreak against Text-to-Image Models
- Authors: Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, Yang Liu,
- Abstract summary: We propose an LLM-driven perception-guided jailbreak method, termed PGJ.
It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts.
The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.
- Score: 18.825079959947857
- License:
- Abstract: In recent years, Text-to-Image (T2I) models have garnered significant attention due to their remarkable advancements. However, security concerns have emerged due to their potential to generate inappropriate or Not-Safe-For-Work (NSFW) images. In this paper, inspired by the observation that texts with different semantics can lead to similar human perceptions, we propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. Specifically, we propose identifying a safe phrase that is similar in human perception yet inconsistent in text semantics with the target unsafe word and using it as a substitution. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.
Related papers
- Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models [80.77246856082742]
Safety Snowball Agent (SSA) is a novel agent-based framework leveraging agents' autonomous and tool-using abilities to jailbreak LVLMs.
Our experiments demonstrate that ours can use nearly any image to induce LVLMs to produce unsafe content, achieving high success jailbreaking rates against the latest LVLMs.
arXiv Detail & Related papers (2024-11-18T11:58:07Z) - IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves [67.30731020715496]
We propose a novel jailbreak method named IDEATOR, which autonomously generates malicious image-text pairs for black-box jailbreak attacks.
IDEATOR uses a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model.
It achieves a 94% success rate in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high success rates of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Meta's Chameleon.
arXiv Detail & Related papers (2024-10-29T07:15:56Z) - BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models [0.0]
We introduce the concept of BiasJailbreak, highlighting the inherent risks posed by safety-induced biases in large language models (LLMs)
We propose an efficient defense method BiasDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation.
Our findings emphasize that ethical biases in LLMs can actually lead to generating unsafe output, and suggest a method to make the LLMs more secure and unbiased.
arXiv Detail & Related papers (2024-10-17T08:46:09Z) - Multimodal Pragmatic Jailbreak on Text-to-image Models [43.67831238116829]
This work introduces a novel type of jailbreak, which triggers T2I models to generate the image with visual text.
We benchmark nine representative T2I models, including two close-source commercial models.
All tested models suffer from such type of jailbreak, with rates of unsafe generation ranging from 8% to 74%.
arXiv Detail & Related papers (2024-09-27T21:23:46Z) - BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger [67.75420257197186]
In this work, we propose $textbfBaThe, a simple yet effective jailbreak defense mechanism.
Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses.
We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks.
arXiv Detail & Related papers (2024-08-17T04:43:26Z) - Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively.
In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z) - Automatic Jailbreaking of the Text-to-Image Generative AI Systems [76.9697122883554]
We study the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts.
We propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards.
Our framework successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time.
arXiv Detail & Related papers (2024-05-26T13:32:24Z) - Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification.
We propose Latent Guard, a framework designed to improve safety measures in text-to-image generation.
Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts.
arXiv Detail & Related papers (2024-04-11T17:59:52Z) - Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models [10.70975463369742]
We present the Jailbreaking Prompt Attack (JPA)
JPA searches for the target malicious concepts in the text embedding space using a group of antonyms.
A prefix prompt is optimized in the discrete vocabulary space to align malicious concepts semantically in the text embedding space.
arXiv Detail & Related papers (2024-04-02T09:49:35Z) - Backdooring Textual Inversion for Concept Censorship [34.84218971929207]
This paper focuses on the personalization technique dubbed Textual Inversion (TI)
TI crafts the word embedding that contains detailed information about a specific object.
To achieve the concept censorship of a TI model, we propose injecting backdoors into the TI embeddings.
arXiv Detail & Related papers (2023-08-21T13:39:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.