Jailbreaking Attack against Multimodal Large Language Model
- URL: http://arxiv.org/abs/2402.02309v1
- Date: Sun, 4 Feb 2024 01:29:24 GMT
- Title: Jailbreaking Attack against Multimodal Large Language Model
- Authors: Zhenxing Niu and Haodong Ren and Xinbo Gao and Gang Hua and Rong Jin
- Abstract summary: This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs)
A maximum likelihood-based algorithm is proposed to find an emphimage Jailbreaking Prompt (imgJP)
Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models.
- Score: 69.52466793164618
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper focuses on jailbreaking attacks against multi-modal large language
models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to
harmful user queries. A maximum likelihood-based algorithm is proposed to find
an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs
across multiple unseen prompts and images (i.e., data-universal property). Our
approach exhibits strong model-transferability, as the generated imgJP can be
transferred to jailbreak various models, including MiniGPT-v2, LLaVA,
InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a
connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we
introduce a construction-based method to harness our approach for
LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art
methods. The code is available here. \textbf{Warning: some content generated by
language models may be offensive to some readers.}
Related papers
- Efficient LLM-Jailbreaking by Introducing Visual Modality [28.925716670778076]
This paper focuses on jailbreaking attacks against large language models (LLMs)
Our approach begins by constructing a multimodal large language model (MLLM) through the incorporation of a visual module into the target LLM.
We convert the embJS into text space to facilitate the jailbreaking of the target LLM.
arXiv Detail & Related papers (2024-05-30T12:50:32Z) - JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks [24.69275959735538]
This paper investigates whether techniques that successfully jailbreak Large Language Models can be equally effective in jailbreaking MLLMs.
We introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs.
We generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks.
arXiv Detail & Related papers (2024-04-03T19:23:18Z) - Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models [107.88745040504887]
We study the harmlessness alignment problem of multimodal large language models (MLLMs)
Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input.
Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision.
arXiv Detail & Related papers (2024-03-14T18:24:55Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.