Related papers: Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models

Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models

URL: http://arxiv.org/abs/2509.12724v1
Date: Tue, 16 Sep 2025 06:25:58 GMT
Title: Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models
Authors: Yunhan Zhao, Xiang Zheng, Xingjun Ma,
Abstract summary: Defense2Attack is a novel jailbreak method that bypasses the safety guardrails of Vision-Language Models.<n>Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state-of-the-art attack methods.
Score: 32.752269224536754
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite their superb capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks. While recent jailbreaks have achieved notable progress, their effectiveness and efficiency can still be improved. In this work, we reveal an interesting phenomenon: incorporating weak defense into the attack pipeline can significantly enhance both the effectiveness and the efficiency of jailbreaks on VLMs. Building on this insight, we propose Defense2Attack, a novel jailbreak method that bypasses the safety guardrails of VLMs by leveraging defensive patterns to guide jailbreak prompt design. Specifically, Defense2Attack consists of three key components: (1) a visual optimizer that embeds universal adversarial perturbations with affirmative and encouraging semantics; (2) a textual optimizer that refines the input using a defense-styled prompt; and (3) a red-team suffix generator that enhances the jailbreak through reinforcement fine-tuning. We empirically evaluate our method on four VLMs and four safety benchmarks. The results demonstrate that Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state-of-the-art attack methods that often require multiple tries. Our work offers a new perspective on jailbreaking VLMs.

Related papers

Proactive defense against LLM Jailbreak [28.249786308207046]
ProAct is a novel proactive defense framework designed to disrupt and mislead autonomous jailbreaking processes.<n>Our method consistently and significantly reduces attack success rates by up to 92%.
arXiv Detail & Related papers (2025-10-06T17:32:40Z)
Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models [80.66766532477973]
Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.<n>Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.
arXiv Detail & Related papers (2025-05-28T11:57:46Z)
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs [13.54228868302755]
ArrAttack is an attack method designed to target defended large language models (LLMs)<n>ArrAttack automatically generates robust jailbreak prompts capable of bypassing various defense measures.<n>Our work bridges the gap between jailbreak attacks and defenses, providing a fresh perspective on generating robust jailbreak prompts.
arXiv Detail & Related papers (2025-05-23T08:02:38Z)
JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs [11.924542310342282]
We present JailPO, a novel black-box jailbreak framework to examine Large Language Models (LLMs) alignment.<n>For scalability and universality, JailPO meticulously trains attack models to automatically generate covert jailbreak prompts.<n>We also introduce a preference optimization-based attack method to enhance the jailbreak effectiveness.
arXiv Detail & Related papers (2024-12-20T07:29:10Z)
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment [97.38766396447369]
Despite training-time safety alignment, Multimodal Large Language Models (MLLMs) remain vulnerable to jailbreak attacks.<n>We propose Immune, an inference-time defense framework that leverages a safe reward model through controlled decoding to defend against jailbreak attacks.
arXiv Detail & Related papers (2024-11-27T19:00:10Z)
Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models [0.0]
We propose a novel black-box jailbreak attacking framework that incorporates various LLM-as-Attacker methods.<n>Our method is designed based on three key observations from existing jailbreaking studies and practices.
arXiv Detail & Related papers (2024-10-31T01:55:33Z)
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves [70.43466586161345]
We propose IDEATOR, a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks.<n>In this paper, we demonstrate IDEATOR's high effectiveness and transferability, achieving a 94% attack success rate (ASR) in jailbreaking MiniGPT-4.<n>Building on IDEATOR's strong transferability and automated process, we introduce the VLJailbreakBench, a safety benchmark comprising 3,654 multimodal jailbreak samples.
arXiv Detail & Related papers (2024-10-29T07:15:56Z)
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks [62.58434630634917]
We propose a novel blue-team method BlueSuffix that defends target VLMs against jailbreak attacks.<n>BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator.<n>We show that BlueSuffix outperforms the baseline defenses by a significant margin.
arXiv Detail & Related papers (2024-10-28T12:43:47Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner [21.414701448926614]
This paper introduces a generic LLM jailbreak defense framework called SelfDefend.<n>We empirically validate using mainstream GPT-3.5/4 models against major jailbreak attacks.<n>To further improve the defense's robustness and minimize costs, we employ a data distillation approach to tune dedicated open-source defense models.
arXiv Detail & Related papers (2024-06-08T15:45:31Z)
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts [64.60375604495883]
We discover a system prompt leakage vulnerability in GPT-4V. By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts. We also evaluate the effect of modifying system prompts to defend against jailbreaking attacks.
arXiv Detail & Related papers (2023-11-15T17:17:39Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.