JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework
- URL: http://arxiv.org/abs/2511.07315v1
- Date: Mon, 10 Nov 2025 17:16:46 GMT
- Title: JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework
- Authors: Yuxuan Zhou, Yang Bai, Kuofeng Gao, Tao Dai, Shu-Tao Xia,
- Abstract summary: JPRO is a novel multi-agent collaborative framework designed for automated VLM jailbreaking.<n>It overcomes the shortcomings of prior methods in attack diversity and scalability.<n> Experimental results show that JPRO achieves over a 60% attack success rate on multiple advanced VLMs.
- Score: 56.78050386956432
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The widespread application of large VLMs makes ensuring their secure deployment critical. While recent studies have demonstrated jailbreak attacks on VLMs, existing approaches are limited: they require either white-box access, restricting practicality, or rely on manually crafted patterns, leading to poor sample diversity and scalability. To address these gaps, we propose JPRO, a novel multi-agent collaborative framework designed for automated VLM jailbreaking. It effectively overcomes the shortcomings of prior methods in attack diversity and scalability. Through the coordinated action of four specialized agents and its two core modules: Tactic-Driven Seed Generation and Adaptive Optimization Loop, JPRO generates effective and diverse attack samples. Experimental results show that JPRO achieves over a 60\% attack success rate on multiple advanced VLMs, including GPT-4o, significantly outperforming existing methods. As a black-box attack approach, JPRO not only uncovers critical security vulnerabilities in multimodal models but also offers valuable insights for evaluating and enhancing VLM robustness.
Related papers
- Multi-turn Jailbreaking Attack in Multi-Modal Large Language Models [2.7051096873824982]
This paper introduces MJAD-MLLMs, a holistic framework that analyzes the proposed Multi-turn Jailbreaking Attacks and multi-LLM-based defense techniques for MLLMs.<n>We introduce a novel multi-turn jailbreaking attack to exploit the vulnerabilities of the MLLMs under multi-turn prompting.<n>Second, we propose a novel fragment-optimized and multi-LLM defense mechanism, called FragGuard, to effectively mitigate jailbreaking attacks in the MLLMs.
arXiv Detail & Related papers (2026-01-08T19:37:22Z) - RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models [60.201244463046784]
Large language models are vulnerable to jailbreak attacks.<n>This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models.
arXiv Detail & Related papers (2025-12-08T17:42:59Z) - Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models [54.61181161508336]
We introduce Multi-Faceted Attack (MFA), a framework that exposes general safety vulnerabilities in leading defense-equipped Vision-Language Models (VLMs)<n>The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives.<n>MFA achieves a 58.5% success rate and consistently outperforms existing methods.
arXiv Detail & Related papers (2025-11-20T07:12:54Z) - Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - VERA: Variational Inference Framework for Jailbreaking Large Language Models [29.57412296290215]
API-only access to state-of-the-art LLMs highlights the need for effective black-box jailbreak methods.<n>We introduce VERA: Variational infErence fRamework for jAilbreaking.
arXiv Detail & Related papers (2025-06-27T22:22:00Z) - Towards Robust Multimodal Large Language Models Against Jailbreak Attacks [24.491648943977605]
We introduce SafeMLLM, which alternates between an attack step for generating adversarial noise and a model updating step.<n>At the attack step, SafeMLLM generates adversarial perturbations through a newly proposed contrastive embedding attack (CoE-Attack)<n>We evaluate SafeMLLM across six MLLMs and six jailbreak methods spanning multiple modalities.
arXiv Detail & Related papers (2025-02-02T03:45:49Z) - Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models [0.0]
multimodal risk distribution jailbreak attack method, called HIMRD, is black-box and consists of two elements: multimodal risk distribution strategy and harmful-induced search strategy.<n> HIMRD achieves an average attack success rate (ASR) of 90% across seven open-source MLLMs and an average ASR of around 68% in three closed-source MLLMs.
arXiv Detail & Related papers (2024-12-08T13:20:45Z) - Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey [50.031628043029244]
Multimodal generative models are susceptible to jailbreak attacks, which can bypass built-in safety mechanisms and induce the production of potentially harmful content.<n>We present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models.
arXiv Detail & Related papers (2024-11-14T07:51:51Z) - Transferable & Stealthy Ensemble Attacks: A Black-Box Jailbreaking Framework for Large Language Models [1.0742675209112622]
We present a novel black-box jailbreaking framework that integrates multiple LLM-as-Attacker strategies to deliver highly transferable and effective attacks.<n>The framework is grounded in three key insights from prior jailbreaking research and practice.
arXiv Detail & Related papers (2024-10-31T01:55:33Z) - IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves [70.43466586161345]
IDEATOR is a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks.<n>Our benchmark results on 11 recently releasedVLMs reveal significant gaps in safety alignment.<n>For instance, our challenge set ASRs of 46.31% on GPT-4o and 19.65% on Claude-3.5-Sonnet, underscoring the urgent need for stronger defenses.
arXiv Detail & Related papers (2024-10-29T07:15:56Z) - BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models [47.576957746503666]
BlackDAN is an innovative black-box attack framework with multi-objective optimization.<n>It generates high-quality prompts that effectively facilitate jailbreaking.<n>It maintains contextual relevance and minimize detectability.
arXiv Detail & Related papers (2024-10-13T11:15:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.