Distraction is All You Need for Multimodal Large Language Model Jailbreaking
- URL: http://arxiv.org/abs/2502.10794v2
- Date: Tue, 17 Jun 2025 02:28:34 GMT
- Title: Distraction is All You Need for Multimodal Large Language Model Jailbreaking
- Authors: Zuopeng Yang, Jiluan Fan, Anli Yan, Erdun Gao, Xin Lin, Tao Li, Kanghua Mo, Changyu Dong,
- Abstract summary: We propose Contrasting Subimage Distraction Jailbreaking (CS-DJ) to disrupt MLLMs alignment through multi-level distraction strategies.<n>CS-DJ achieves average success rates of 52.40% for the attack success rate and 74.10% for the ensemble attack success rate.<n>These results reveal the potential of distraction-based approaches to exploit and bypass MLLMs' defenses.
- Score: 14.787247403225294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) bridge the gap between visual and textual data, enabling a range of advanced applications. However, complex internal interactions among visual elements and their alignment with text can introduce vulnerabilities, which may be exploited to bypass safety mechanisms. To address this, we analyze the relationship between image content and task and find that the complexity of subimages, rather than their content, is key. Building on this insight, we propose the Distraction Hypothesis, followed by a novel framework called Contrasting Subimage Distraction Jailbreaking (CS-DJ), to achieve jailbreaking by disrupting MLLMs alignment through multi-level distraction strategies. CS-DJ consists of two components: structured distraction, achieved through query decomposition that induces a distributional shift by fragmenting harmful prompts into sub-queries, and visual-enhanced distraction, realized by constructing contrasting subimages to disrupt the interactions among visual elements within the model. This dual strategy disperses the model's attention, reducing its ability to detect and mitigate harmful content. Extensive experiments across five representative scenarios and four popular closed-source MLLMs, including GPT-4o-mini, GPT-4o, GPT-4V, and Gemini-1.5-Flash, demonstrate that CS-DJ achieves average success rates of 52.40% for the attack success rate and 74.10% for the ensemble attack success rate. These results reveal the potential of distraction-based approaches to exploit and bypass MLLMs' defenses, offering new insights for attack strategies.
Related papers
- MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs [22.919956583415324]
Multi-Image Dispersion and Semantic Reconstruction (MIDAS)<n>We propose a multimodal jailbreak framework that decomposes harmful semantics into risk-bearing subunits.<n>MIDAS enforces longer and more structured multi-image chained reasoning.
arXiv Detail & Related papers (2026-02-28T09:29:36Z) - Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities [34.64588827428617]
We propose a new image-centric attack method, Contextual Image Attack (CIA)<n>CIA embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies.<n>Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.
arXiv Detail & Related papers (2025-12-02T17:51:02Z) - Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models [54.61181161508336]
We introduce Multi-Faceted Attack (MFA), a framework that exposes general safety vulnerabilities in leading defense-equipped Vision-Language Models (VLMs)<n>The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives.<n>MFA achieves a 58.5% success rate and consistently outperforms existing methods.
arXiv Detail & Related papers (2025-11-20T07:12:54Z) - VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models [19.867040067010674]
We introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts.<n>We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks.<n>Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-10-20T17:12:10Z) - Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling [11.939828002077482]
Multimodal large language models (MLLMs) exhibit remarkable capabilities but remain susceptible to jailbreak attacks.<n>We introduce a novel method that leverages sequential comic-style visual narratives to circumvent safety alignments in state-of-the-art MLLMs.<n>Our approach achieves an average attack success rate of 83.5%, surpassing prior state-of-the-art by 46%.
arXiv Detail & Related papers (2025-10-16T18:30:26Z) - Bridging the Task Gap: Multi-Task Adversarial Transferability in CLIP and Its Derivatives [61.58574200236532]
Adversarial examples generated from fine-grained tasks often exhibit stronger transfer potential than those from coarse-grained tasks.<n>We propose a novel framework, Multi-Task Adversarial CLIP (MT-AdvCLIP), which introduces a task-aware feature aggregation loss and generates perturbations with enhanced cross-task generalization capability.
arXiv Detail & Related papers (2025-09-28T14:46:52Z) - Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection [19.91087036440618]
multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications.<n>The security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments.<n>Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs.<n>In this work, we define a novel setting: visual-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context.
arXiv Detail & Related papers (2025-07-03T17:53:12Z) - Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models [20.99874786089634]
Previous jailbreak attacks often inject malicious instructions from text into less aligned modalities, such as vision.<n>We propose a novel implicit jailbreak framework termed IJA that stealthily embeds malicious instructions into images via at least significant bit steganography.<n>On commercial models like GPT-4o and Gemini-1.5 Pro, our method achieves attack success rates of over 90% using an average of only 3 queries.
arXiv Detail & Related papers (2025-05-22T09:34:47Z) - MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.
It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.
We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z) - A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1 [24.599707290204524]
Transfer-based targeted attacks on large vision-language models (LVLMs) often fail against black-box commercial LVLMs.
We propose an approach that refines semantic clarity by encoding explicit semantic details within local regions.
Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods.
arXiv Detail & Related papers (2025-03-13T17:59:55Z) - Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models [92.79804303337522]
Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues.<n>We introduce MLAI, a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment.<n>Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2.
arXiv Detail & Related papers (2024-11-27T02:40:29Z) - AnyAttack: Targeted Adversarial Attacks on Vision-Language Models toward Any Images [41.044385916368455]
We propose AnyAttack, a self-supervised framework that generates targeted adversarial images for Vision-Language Models without label supervision.<n>Our framework employs the pre-training and fine-tuning paradigm, with the adversarial noise generator pre-trained on the large-scale LAION-400M dataset.
arXiv Detail & Related papers (2024-10-07T09:45:18Z) - Compromising Embodied Agents with Contextual Backdoor Attacks [69.71630408822767]
Large language models (LLMs) have transformed the development of embodied intelligence.
This paper uncovers a significant backdoor security threat within this process.
By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM.
arXiv Detail & Related papers (2024-08-06T01:20:12Z) - Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models [17.663550432103534]
Multimodal Large Language Models (MLLMs) extend the capacity of LLMs to understand multimodal information comprehensively.
These models are susceptible to jailbreak attacks, where malicious users can break the safety alignment of the target model and generate misleading and harmful answers.
We propose Cross-modality Information DEtectoR (CIDER), a plug-and-play jailbreaking detector designed to identify maliciously perturbed image inputs.
arXiv Detail & Related papers (2024-07-31T15:02:46Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - Adversarial Robustness for Visual Grounding of Multimodal Large Language Models [49.71757071535619]
Multi-modal Large Language Models (MLLMs) have recently achieved enhanced performance across various vision-language tasks.
adversarial robustness of visual grounding remains unexplored in MLLMs.
We propose three adversarial attack paradigms as follows.
arXiv Detail & Related papers (2024-05-16T10:54:26Z) - Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.