PiCo: Jailbreaking Multimodal Large Language Models via $\textbf{Pi}$ctorial $\textbf{Co}$de Contextualization
- URL: http://arxiv.org/abs/2504.01444v2
- Date: Mon, 07 Apr 2025 08:05:25 GMT
- Title: PiCo: Jailbreaking Multimodal Large Language Models via $\textbf{Pi}$ctorial $\textbf{Co}$de Contextualization
- Authors: Aofan Liu, Lulu Tang, Ting Pan, Yuguo Yin, Bin Wang, Ao Yang,
- Abstract summary: We present PiCo, a novel jailbreaking framework designed to bypass multi-tiered defense mechanisms in advanced MLLMs.<n>By embedding harmful intent within code-style visual instructions, PiCo achieves an average Attack Success Rate (ASR) of 84.13% on Gemini-Pro Vision and 52.66% on GPT-4.
- Score: 10.926597202005444
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multimodal Large Language Models (MLLMs), which integrate vision and other modalities into Large Language Models (LLMs), significantly enhance AI capabilities but also introduce new security vulnerabilities. By exploiting the vulnerabilities of the visual modality and the long-tail distribution characteristic of code training data, we present PiCo, a novel jailbreaking framework designed to progressively bypass multi-tiered defense mechanisms in advanced MLLMs. PiCo employs a tier-by-tier jailbreak strategy, using token-level typographic attacks to evade input filtering and embedding harmful intent within programming context instructions to bypass runtime monitoring. To comprehensively assess the impact of attacks, a new evaluation metric is further proposed to assess both the toxicity and helpfulness of model outputs post-attack. By embedding harmful intent within code-style visual instructions, PiCo achieves an average Attack Success Rate (ASR) of 84.13% on Gemini-Pro Vision and 52.66% on GPT-4, surpassing previous methods. Experimental results highlight the critical gaps in current defenses, underscoring the need for more robust strategies to secure advanced MLLMs.
Related papers
- T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content.
We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats.
Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z) - MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z) - Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks [17.75247947379804]
We present the first adversarial training paradigm tailored to defend against jailbreak attacks during the MLLM training phase.
We introduce Projection Layer Against Adversarial Training (ProEAT), an end-to-end AT framework.
ProEAT achieves state-of-the-art defense performance, outperforming existing baselines by an average margin of +34% across text and image modalities.
arXiv Detail & Related papers (2025-03-05T14:13:35Z) - Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models [26.656858396343726]
Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations.<n>Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data.<n>We explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data.
arXiv Detail & Related papers (2025-02-03T17:59:45Z) - Retention Score: Quantifying Jailbreak Risks for Vision Language Models [60.48306899271866]
Vision-Language Models (VLMs) are integrated with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities.<n>This paper aims to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs.<n>To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the textbfRetention Score.
arXiv Detail & Related papers (2024-12-23T13:05:51Z) - $\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models [11.02754617539271]
We introduce textitMMJ-Bench, a unified pipeline for evaluating jailbreak attacks and defense techniques for MLLMs.
We assess the effectiveness of various attack methods against SoTA MLLMs and evaluate the impact of defense mechanisms on both defense effectiveness and model utility.
arXiv Detail & Related papers (2024-08-16T00:18:23Z) - Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models [17.663550432103534]
Multimodal Large Language Models (MLLMs) extend the capacity of LLMs to understand multimodal information comprehensively.
These models are susceptible to jailbreak attacks, where malicious users can break the safety alignment of the target model and generate misleading and harmful answers.
We propose Cross-modality Information DEtectoR (CIDER), a plug-and-play jailbreaking detector designed to identify maliciously perturbed image inputs.
arXiv Detail & Related papers (2024-07-31T15:02:46Z) - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)
Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.
Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - VL-Trojan: Multimodal Instruction Backdoor Attacks against
Autoregressive Visual Language Models [65.23688155159398]
Autoregressive Visual Language Models (VLMs) showcase impressive few-shot learning capabilities in a multimodal context.
Recently, multimodal instruction tuning has been proposed to further enhance instruction-following abilities.
Adversaries can implant a backdoor by injecting poisoned samples with triggers embedded in instructions or images.
We propose a multimodal instruction backdoor attack, namely VL-Trojan.
arXiv Detail & Related papers (2024-02-21T14:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.