Related papers: IDEATOR: Jailbreaking VLMs Using VLMs

IDEATOR: Jailbreaking VLMs Using VLMs

URL: http://arxiv.org/abs/2411.00827v1
Date: Tue, 29 Oct 2024 07:15:56 GMT
Title: IDEATOR: Jailbreaking VLMs Using VLMs
Authors: Ruofan Wang, Bo Wang, Xingjun Ma, Yu-Gang Jiang,
Abstract summary: We propose a novel jailbreak method named IDEATOR, which autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR employs a VLM to generate jailbreak texts while leveraging a state-of-the-art diffusion model to create corresponding jailbreak images. It successfully jailbreaks MiniGPT-4 with a 94% success rate and transfers seamlessly to LLaVA and InstructBLIP, achieving high success rates of 82% and 88%, respectively.
Score: 68.4760494411687
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large Vision-Language Models (VLMs) continue to gain prominence, ensuring their safety deployment in real-world applications has become a critical concern. Recently, significant research efforts have focused on evaluating the robustness of VLMs against jailbreak attacks. Due to challenges in obtaining multi-modal data, current studies often assess VLM robustness by generating adversarial or query-relevant images based on harmful text datasets. However, the jailbreak images generated this way exhibit certain limitations. Adversarial images require white-box access to the target VLM and are relatively easy to defend against, while query-relevant images must be linked to the target harmful content, limiting their diversity and effectiveness. In this paper, we propose a novel jailbreak method named IDEATOR, which autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR is a VLM-based approach inspired by our conjecture that a VLM itself might be a powerful red team model for generating jailbreak prompts. Specifically, IDEATOR employs a VLM to generate jailbreak texts while leveraging a state-of-the-art diffusion model to create corresponding jailbreak images. Extensive experiments demonstrate the high effectiveness and transferability of IDEATOR. It successfully jailbreaks MiniGPT-4 with a 94% success rate and transfers seamlessly to LLaVA and InstructBLIP, achieving high success rates of 82% and 88%, respectively. IDEATOR uncovers previously unrecognized vulnerabilities in VLMs, calling for advanced safety mechanisms.

Related papers

Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks. We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets. Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Jailbreak Large Vision-Language Models Through Multi-Modal Linkage [14.025750623315561]
We propose a novel jailbreak attack framework: Multi-Modal (MML) Attack. Drawing inspiration from cryptography, MML utilizes an encryption-decryption process across text and image modalities to mitigate over-exposure of malicious information. Experiments demonstrate MML's effectiveness. Specifically, MML jailbreaks GPT-4o with attack success rates of 97.80% on SafeBench, 98.81% on MM-SafeBench and 99.07% on HADES-Dataset.
arXiv Detail & Related papers (2024-11-30T13:21:15Z)
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks [62.58434630634917]
Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks. We propose a novel blue-team method BlueSuffix that defends the black-box target VLM against jailbreak attacks without compromising its performance.
arXiv Detail & Related papers (2024-10-28T12:43:47Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts. It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks. Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
h4rm3l: A language for Composable Jailbreak Attack Synthesis [48.5611060845958]
h4rm3l is a novel approach that addresses the gap with a human-readable domain-specific language. We show that h4rm3l's synthesized attacks are diverse and more successful than existing jailbreak attacks in literature.
arXiv Detail & Related papers (2024-08-09T01:45:39Z)
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent [24.487441771427434]
We propose a multi-agent LLM system named RedAgent to generate context-aware jailbreak prompts. Our system can jailbreak most black-box LLMs in just five queries, improving the efficiency of existing red teaming methods by two times. We have reported all found issues and communicated with OpenAI and Meta for bug fixes.
arXiv Detail & Related papers (2024-07-23T17:34:36Z)
Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything [4.477597131613079]
This paper introduces a novel dataset Flow-JD specifically designed to evaluate the logic-based flowchart jailbreak capabilities of VLMs. We conduct an extensive evaluation on GPT-4o, GPT-4V, other 5 SOTA open source VLMs and the jailbreak rate is up to 92.8%. Our research reveals significant vulnerabilities in current VLMs concerning image-to-text jailbreak and these findings underscore the urgency for the development of robust and effective future defenses.
arXiv Detail & Related papers (2024-07-01T16:58:55Z)
Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks. Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z)
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? [39.87609532392292]
This work builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. GPT4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs. Llama2 and Qwen-VL-Chat are more robust compared to other open-source models.
arXiv Detail & Related papers (2024-04-04T12:38:14Z)
EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs) It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z)
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models [107.88745040504887]
We study the harmlessness alignment problem of multimodal large language models (MLLMs) Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision.
arXiv Detail & Related papers (2024-03-14T18:24:55Z)
Jailbreaking Attack against Multimodal Large Language Model [69.52466793164618]
This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs) A maximum likelihood-based algorithm is proposed to find an emphimage Jailbreaking Prompt (imgJP) Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models.
arXiv Detail & Related papers (2024-02-04T01:29:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.