Jailbreaking Attack against Multimodal Large Language Model
- URL: http://arxiv.org/abs/2402.02309v1
- Date: Sun, 4 Feb 2024 01:29:24 GMT
- Title: Jailbreaking Attack against Multimodal Large Language Model
- Authors: Zhenxing Niu and Haodong Ren and Xinbo Gao and Gang Hua and Rong Jin
- Abstract summary: This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs)
A maximum likelihood-based algorithm is proposed to find an emphimage Jailbreaking Prompt (imgJP)
Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models.
- Score: 69.52466793164618
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper focuses on jailbreaking attacks against multi-modal large language
models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to
harmful user queries. A maximum likelihood-based algorithm is proposed to find
an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs
across multiple unseen prompts and images (i.e., data-universal property). Our
approach exhibits strong model-transferability, as the generated imgJP can be
transferred to jailbreak various models, including MiniGPT-v2, LLaVA,
InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a
connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we
introduce a construction-based method to harness our approach for
LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art
methods. The code is available here. \textbf{Warning: some content generated by
language models may be offensive to some readers.}
Related papers
- Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction [32.04296423547049]
Large Language Models (LLMs) are widely applied in various domains.
We propose the Rewrite to Jailbreak (R2J) approach, a transferable black-box jailbreak method to attack LLMs.
arXiv Detail & Related papers (2025-02-16T11:43:39Z) - BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger [67.75420257197186]
In this work, we propose $textbfBaThe, a simple yet effective jailbreak defense mechanism.
Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses.
We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks.
arXiv Detail & Related papers (2024-08-17T04:43:26Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Efficient LLM-Jailbreaking by Introducing Visual Modality [28.925716670778076]
This paper focuses on jailbreaking attacks against large language models (LLMs)
Our approach begins by constructing a multimodal large language model (MLLM) through the incorporation of a visual module into the target LLM.
We convert the embJS into text space to facilitate the jailbreaking of the target LLM.
arXiv Detail & Related papers (2024-05-30T12:50:32Z) - JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks [24.69275959735538]
This paper investigates whether techniques that successfully jailbreak Large Language Models can be equally effective in jailbreaking MLLMs.
We introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs.
We generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks.
arXiv Detail & Related papers (2024-04-03T19:23:18Z) - Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models [107.88745040504887]
We study the harmlessness alignment problem of multimodal large language models (MLLMs)
Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input.
Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision.
arXiv Detail & Related papers (2024-03-14T18:24:55Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.