Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
- URL: http://arxiv.org/abs/2403.09792v2
- Date: Sun, 14 Apr 2024 11:57:55 GMT
- Title: Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
- Authors: Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen,
- Abstract summary: We study the harmlessness alignment problem of multimodal large language models (MLLMs)
Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input.
Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision.
- Score: 107.88745040504887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data will be publicly released.
Related papers
- IDEATOR: Jailbreaking VLMs Using VLMs [68.4760494411687]
We propose a novel jailbreak method named IDEATOR, which autonomously generates malicious image-text pairs for black-box jailbreak attacks.
IDEATOR employs a VLM to generate jailbreak texts while leveraging a state-of-the-art diffusion model to create corresponding jailbreak images.
It successfully jailbreaks MiniGPT-4 with a 94% success rate and transfers seamlessly to LLaVA and InstructBLIP, achieving high success rates of 82% and 88%, respectively.
arXiv Detail & Related papers (2024-10-29T07:15:56Z) - Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models [21.252514293436437]
We propose Analyzing-based Jailbreak (ABJ) to combat jailbreak attacks on Large Language Models (LLMs)
ABJ achieves 94.8% attack success rate (ASR) and 1.06 attack efficiency (AE) on GPT-4-turbo-0409, demonstrating state-of-the-art attack effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-23T06:14:41Z) - Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively.
In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character [5.927633974815329]
We propose a novel and effective method called Visual Role-play (VRP) for MLLM jailbreak attacks.
VRP generates detailed descriptions of high-risk characters and create corresponding images based on the descriptions.
When paired with benign role-play instruction texts, these high-risk character images effectively mislead MLLMs into generating malicious responses.
arXiv Detail & Related papers (2024-05-25T17:17:18Z) - JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks [24.69275959735538]
This paper investigates whether techniques that successfully jailbreak Large Language Models can be equally effective in jailbreaking MLLMs.
We introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs.
We generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks.
arXiv Detail & Related papers (2024-04-03T19:23:18Z) - Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes [61.916827858666906]
Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer.
To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback.
Recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails.
This paper proposes a method called Gradient Cuff to detect jailbreak attempts.
arXiv Detail & Related papers (2024-03-01T03:29:54Z) - Jailbreaking Attack against Multimodal Large Language Model [69.52466793164618]
This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs)
A maximum likelihood-based algorithm is proposed to find an emphimage Jailbreaking Prompt (imgJP)
Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models.
arXiv Detail & Related papers (2024-02-04T01:29:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.