Related papers: GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

URL: http://arxiv.org/abs/2601.03416v1
Date: Tue, 06 Jan 2026 21:09:10 GMT
Title: GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models
Authors: Xiangdong Hu, Yangyang Jiang, Qin Hu, Xiaojun Jia,
Abstract summary: We propose a novel framework that drives a model to explore, reconstruct intent, and answer as part of winning the game.<n>GAMBIT achieves high Attack Success Rates (ASR), reaching 92.13% on Gemini 2.5 Flash, 91.20% on QvQ-MAX, and 85.87% on GPT-4o.
Score: 16.68943715789759
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have become widely deployed, yet their safety alignment remains fragile under adversarial inputs. Previous work has shown that increasing inference steps can disrupt safety mechanisms and lead MLLMs to generate attacker-desired harmful content. However, most existing attacks focus on increasing the complexity of the modified visual task itself and do not explicitly leverage the model's own reasoning incentives. This leads to them underperforming on reasoning models (Models with Chain-of-Thoughts) compared to non-reasoning ones (Models without Chain-of-Thoughts). If a model can think like a human, can we influence its cognitive-stage decisions so that it proactively completes a jailbreak? To validate this idea, we propose GAMBI} (Gamified Adversarial Multimodal Breakout via Instructional Traps), a novel multimodal jailbreak framework that decomposes and reassembles harmful visual semantics, then constructs a gamified scene that drives the model to explore, reconstruct intent, and answer as part of winning the game. The resulting structured reasoning chain increases task complexity in both vision and text, positioning the model as a participant whose goal pursuit reduces safety attention and induces it to answer the reconstructed malicious query. Extensive experiments on popular reasoning and non-reasoning MLLMs demonstrate that GAMBIT achieves high Attack Success Rates (ASR), reaching 92.13% on Gemini 2.5 Flash, 91.20% on QvQ-MAX, and 85.87% on GPT-4o, significantly outperforming baselines.

Related papers

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks [12.019519100082798]
Vision-Language Models (VLMs) with multimodal reasoning capabilities are high-value attack targets.<n>We propose textbfCrossTALK (textbfunderlineCross-modal entextbfunderlineTAngtextbfunderlineLement attactextbfunderlineK), which is a scalable approach that extends and entangles information clues across modalities.<n>Experiments show our COMET achieves state-of-the-art attack success rate.
arXiv Detail & Related papers (2026-02-09T18:31:25Z)
RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models [60.201244463046784]
Large language models are vulnerable to jailbreak attacks.<n>This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models.
arXiv Detail & Related papers (2025-12-08T17:42:59Z)
GRAF: Multi-turn Jailbreaking via Global Refinement and Active Fabrication [55.63412213263305]
Large Language Models pose notable safety risks due to potential misuse for malicious purposes.<n>We propose a novel multi-turn jailbreaking method that globally refines the attack trajectory at each interaction.<n>In addition, we actively fabricate model responses to suppress safety-related warnings, thereby increasing the likelihood of eliciting harmful outputs.
arXiv Detail & Related papers (2025-06-22T03:15:05Z)
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation [29.8288014123234]
We investigate the vulnerability of intent-aware guardrails and demonstrate that large language models exhibit implicit intent detection capabilities.<n>We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives.<n>Our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses.
arXiv Detail & Related papers (2025-05-24T06:47:32Z)
Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models [20.99874786089634]
Previous jailbreak attacks often inject malicious instructions from text into less aligned modalities, such as vision.<n>We propose a novel implicit jailbreak framework termed IJA that stealthily embeds malicious instructions into images via at least significant bit steganography.<n>On commercial models like GPT-4o and Gemini-1.5 Pro, our method achieves attack success rates of over 90% using an average of only 3 queries.
arXiv Detail & Related papers (2025-05-22T09:34:47Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [50.40122190627256]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking [60.78524314357671]
We investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of large language models (LLMs) Our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload.
arXiv Detail & Related papers (2023-11-16T11:52:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.