Related papers: White-box Multimodal Jailbreaks Against Large Vision-Language Models

White-box Multimodal Jailbreaks Against Large Vision-Language Models

URL: http://arxiv.org/abs/2405.17894v2
Date: Mon, 14 Oct 2024 03:15:23 GMT
Title: White-box Multimodal Jailbreaks Against Large Vision-Language Models
Authors: Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, Yu-Gang Jiang,
Abstract summary: We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input. An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
Score: 61.97578116584653
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in Large Vision-Language Models (VLMs) have underscored their superiority in various multimodal tasks. However, the adversarial robustness of VLMs has not been fully explored. Existing methods mainly assess robustness through unimodal adversarial attacks that perturb images, while assuming inherent resilience against text-based attacks. Different from existing attacks, in this work we propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within VLMs. Specifically, we propose a dual optimization objective aimed at guiding the model to generate affirmative responses with high toxicity. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input, thus imbuing the image with toxic semantics. Subsequently, an adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions. The discovered adversarial image prefix and text suffix are collectively denoted as a Universal Master Key (UMK). When integrated into various malicious queries, UMK can circumvent the alignment defenses of VLMs and lead to the generation of objectionable content, known as jailbreaks. The experimental results demonstrate that our universal attack strategy can effectively jailbreak MiniGPT-4 with a 96% success rate, highlighting the vulnerability of VLMs and the urgent need for new alignment strategies.

Related papers

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack [40.68344330540352]
Multimodal Large Language Models (MLLMs) are widely used in various fields due to their powerful cross-modal comprehension and generation capabilities.<n>Previous jailbreak attacks try to explore reasoning safety risk in text modal, while similar threats have been largely overlooked in the visual modal.<n>We propose Visual Reasoning Sequential Attack (VRSA), which induces MLLMs to gradually externalize and aggregate complete harmful intent.
arXiv Detail & Related papers (2025-12-05T16:29:52Z)
JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering [73.962469626788]
Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus.<n>We propose JPS, underlineJailbreak MLLMs with collaborative visual underlinePerturbation and textual underlineSteering.
arXiv Detail & Related papers (2025-08-07T07:14:01Z)
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models [83.80177564873094]
We propose a unified multimodal universal jailbreak attack framework.<n>We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP.<n>This study underscores the urgent need for robust safety measures in MLLMs.
arXiv Detail & Related papers (2025-06-02T04:33:56Z)
Enhancing Adversarial Robustness of Vision Language Models via Adversarial Mixture Prompt Tuning [14.714776642137247]
Adversarial Mixture Prompt Tuning (AMPT) aims to learn mixture text prompts to obtain more robust text features.<n>A series of experiments show that our method can achieve better adversarial robustness than state-of-the-art methods.
arXiv Detail & Related papers (2025-05-23T06:04:15Z)
Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models [20.99874786089634]
Previous jailbreak attacks often inject malicious instructions from text into less aligned modalities, such as vision.<n>We propose a novel implicit jailbreak framework termed IJA that stealthily embeds malicious instructions into images via at least significant bit steganography.<n>On commercial models like GPT-4o and Gemini-1.5 Pro, our method achieves attack success rates of over 90% using an average of only 3 queries.
arXiv Detail & Related papers (2025-05-22T09:34:47Z)
Retention Score: Quantifying Jailbreak Risks for Vision Language Models [60.48306899271866]
Vision-Language Models (VLMs) are integrated with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities. This paper aims to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs. To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the textbfRetention Score.
arXiv Detail & Related papers (2024-12-23T13:05:51Z)
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs [13.317364896194903]
We propose a two-stage adversarial tuning framework to enhance Large Language Models' generalized defense capabilities. In the first stage, we introduce the hierarchical meta-universal adversarial prompt learning to efficiently generate token-level adversarial prompts. In the second stage, we propose the automatic adversarial prompt learning to iteratively refine semantic-level adversarial prompts.
arXiv Detail & Related papers (2024-06-07T15:37:15Z)
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z)
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character [5.927633974815329]
We propose a novel and effective method called Visual Role-play (VRP) for MLLM jailbreak attacks. VRP generates detailed descriptions of high-risk characters and create corresponding images based on the descriptions. When paired with benign role-play instruction texts, these high-risk character images effectively mislead MLLMs into generating malicious responses.
arXiv Detail & Related papers (2024-05-25T17:17:18Z)
Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors [31.383591942592467]
Vision-language models (VLMs) offer innovative ways to combine visual and textual data for enhanced understanding and interaction. Patch-based adversarial attack is considered the most realistic threat model in physical vision applications. We introduce SmoothVLM, a defense mechanism rooted in smoothing techniques, to protectVLMs from the threat of patched visual prompt injectors.
arXiv Detail & Related papers (2024-05-17T04:19:19Z)
Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective [42.04728834962863]
Pretrained vision-language models (VLMs) like CLIP exhibit exceptional generalization across diverse downstream tasks. Recent studies reveal their vulnerability to adversarial attacks, with defenses against text-based and multimodal attacks remaining largely unexplored. This work presents the first comprehensive study on improving the adversarial robustness of VLMs against attacks targeting image, text, and multimodal inputs.
arXiv Detail & Related papers (2024-04-30T06:34:21Z)
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models [65.23688155159398]
Autoregressive Visual Language Models (VLMs) showcase impressive few-shot learning capabilities in a multimodal context. Recently, multimodal instruction tuning has been proposed to further enhance instruction-following abilities. Adversaries can implant a backdoor by injecting poisoned samples with triggers embedded in instructions or images. We propose a multimodal instruction backdoor attack, namely VL-Trojan.
arXiv Detail & Related papers (2024-02-21T14:54:30Z)
Adversarial Prompt Tuning for Vision-Language Models [86.5543597406173]
Adversarial Prompt Tuning (AdvPT) is a technique to enhance the adversarial robustness of image encoders in Vision-Language Models (VLMs) We demonstrate that AdvPT improves resistance against white-box and black-box adversarial attacks and exhibits a synergistic effect when combined with existing image-processing-based defense techniques.
arXiv Detail & Related papers (2023-11-19T07:47:43Z)
Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks. We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z)
On Evaluating Adversarial Robustness of Large Vision-Language Models [64.66104342002882]
We evaluate the robustness of large vision-language models (VLMs) in the most realistic and high-risk setting. In particular, we first craft targeted adversarial examples against pretrained models such as CLIP and BLIP. Black-box queries on these VLMs can further improve the effectiveness of targeted evasion.
arXiv Detail & Related papers (2023-05-26T13:49:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.