Related papers: MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks

MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks

URL: http://arxiv.org/abs/2503.19134v1
Date: Mon, 24 Mar 2025 20:38:42 GMT
Title: MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks
Authors: Wenhao You, Bryan Hooi, Yiwei Wang, Youke Wang, Zong Ke, Ming-Hsuan Yang, Zi Huang, Yujun Cai,
Abstract summary: MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
Score: 85.3303135160762
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: While safety mechanisms have significantly progressed in filtering harmful text inputs, MLLMs remain vulnerable to multimodal jailbreaks that exploit their cross-modal reasoning capabilities. We present MIRAGE, a novel multimodal jailbreak framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models (MLLMs). By systematically decomposing the toxic query into environment, role, and action triplets, MIRAGE constructs a multi-turn visual storytelling sequence of images and text using Stable Diffusion, guiding the target model through an engaging detective narrative. This process progressively lowers the model's defences and subtly guides its reasoning through structured contextual cues, ultimately eliciting harmful responses. In extensive experiments on the selected datasets with six mainstream MLLMs, MIRAGE achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines. Moreover, we demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards. These results highlight critical weaknesses in current multimodal safety mechanisms and underscore the urgent need for more robust defences against cross-modal threats.

Related papers

Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models [11.867355323884217]
We present a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments.<n>Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency.
arXiv Detail & Related papers (2025-06-20T05:30:25Z)
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models [83.80177564873094]
We propose a unified multimodal universal jailbreak attack framework.<n>We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP.<n>This study underscores the urgent need for robust safety measures in MLLMs.
arXiv Detail & Related papers (2025-06-02T04:33:56Z)
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming [38.25556351567948]
textbfMulti-textbfTurn textbfSafety textbfAlignment (ourapproach) framework for securing large language models.<n>Red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts.<n> adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction.
arXiv Detail & Related papers (2025-05-22T08:22:57Z)
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content. We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats. Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z)
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [0.0]
Large Language Models (LLMs) have revolutionized artificial intelligence, driving advancements in machine translation, summarization, and conversational agents. Recent studies indicate that LLMs remain vulnerable to adversarial attacks designed to elicit biased responses. This work proposes a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation.
arXiv Detail & Related papers (2025-04-10T16:00:59Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Towards Robust Multimodal Large Language Models Against Jailbreak Attacks [24.491648943977605]
We introduce SafeMLLM, which alternates between an attack step for generating adversarial noise and a model updating step.<n>At the attack step, SafeMLLM generates adversarial perturbations through a newly proposed contrastive embedding attack (CoE-Attack)<n>We evaluate SafeMLLM across six MLLMs and six jailbreak methods spanning multiple modalities.
arXiv Detail & Related papers (2025-02-02T03:45:49Z)
Model-Editing-Based Jailbreak against Safety-aligned Large Language Models [13.887770576598646]
Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions. This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters. TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions.
arXiv Detail & Related papers (2024-12-11T08:44:15Z)
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models [104.94706600050557]
Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community. We propose ICER, a novel red-teaming framework that generates interpretable and semantic meaningful problematic prompts. Our work provides crucial insights for developing more robust safety mechanisms in T2I systems.
arXiv Detail & Related papers (2024-11-25T04:17:24Z)
Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey [50.031628043029244]
Multimodal generative models are susceptible to jailbreak attacks, which can bypass built-in safety mechanisms and induce the production of potentially harmful content.<n>We present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models.
arXiv Detail & Related papers (2024-11-14T07:51:51Z)
Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles [2.5167155755957316]
Context Fusion Attack (CFA) is a contextual fusion black-box jailbreak attack method. We show CFA's superior success rate, divergence, and harmfulness compared to other multi-turn attack strategies.
arXiv Detail & Related papers (2024-08-08T09:18:47Z)
Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models [17.663550432103534]
Multimodal Large Language Models (MLLMs) extend the capacity of LLMs to understand multimodal information comprehensively. These models are susceptible to jailbreak attacks, where malicious users can break the safety alignment of the target model and generate misleading and harmful answers. We propose Cross-modality Information DEtectoR (CIDER), a plug-and-play jailbreaking detector designed to identify maliciously perturbed image inputs.
arXiv Detail & Related papers (2024-07-31T15:02:46Z)
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character [5.927633974815329]
We propose a novel and effective method called Visual Role-play (VRP) for MLLM jailbreak attacks. VRP generates detailed descriptions of high-risk characters and create corresponding images based on the descriptions. When paired with benign role-play instruction texts, these high-risk character images effectively mislead MLLMs into generating malicious responses.
arXiv Detail & Related papers (2024-05-25T17:17:18Z)
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.