FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts
- URL: http://arxiv.org/abs/2502.21059v1
- Date: Fri, 28 Feb 2025 13:59:11 GMT
- Title: FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts
- Authors: Ziyi Zhang, Zhen Sun, Zongmin Zhang, Jihui Guo, Xinlei He,
- Abstract summary: We propose a jailbreak attack method based on auto-generated flowcharts, FC-Attack.<n>FC-Attack achieves over 90% attack success rates on Gemini-1.5, Llaval-Next, Qwen2-VL, and InternVL-2.5 models.<n>To mitigate the attack, we explore several defenses and find that AdaShield can largely reduce the jailbreak performance but with the cost of utility drop.
- Score: 20.323340637767327
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Models (LVLMs) have become powerful and widely adopted in some practical applications. However, recent research has revealed their vulnerability to multimodal jailbreak attacks, whereby the model can be induced to generate harmful content, leading to safety risks. Although most LVLMs have undergone safety alignment, recent research shows that the visual modality is still vulnerable to jailbreak attacks. In our work, we discover that by using flowcharts with partially harmful information, LVLMs can be induced to provide additional harmful details. Based on this, we propose a jailbreak attack method based on auto-generated flowcharts, FC-Attack. Specifically, FC-Attack first fine-tunes a pre-trained LLM to create a step-description generator based on benign datasets. The generator is then used to produce step descriptions corresponding to a harmful query, which are transformed into flowcharts in 3 different shapes (vertical, horizontal, and S-shaped) as visual prompts. These flowcharts are then combined with a benign textual prompt to execute a jailbreak attack on LVLMs. Our evaluations using the Advbench dataset show that FC-Attack achieves over 90% attack success rates on Gemini-1.5, Llaval-Next, Qwen2-VL, and InternVL-2.5 models, outperforming existing LVLM jailbreak methods. Additionally, we investigate factors affecting the attack performance, including the number of steps and the font styles in the flowcharts. Our evaluation shows that FC-Attack can improve the jailbreak performance from 4% to 28% in Claude-3.5 by changing the font style. To mitigate the attack, we explore several defenses and find that AdaShield can largely reduce the jailbreak performance but with the cost of utility drop.
Related papers
- Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary [2.4329261266984346]
Large Language Models (LLMs) are designed to generate helpful and safe content.
adversarial attacks, commonly referred to as jailbreak, can bypass their safety protocols.
We introduce a novel jailbreak attack method that leverages the prefilling feature of LLMs.
arXiv Detail & Related papers (2025-04-28T07:38:43Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves [67.30731020715496]
We propose a novel jailbreak method named IDEATOR, which autonomously generates malicious image-text pairs for black-box jailbreak attacks.
IDEATOR uses a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model.
It achieves a 94% success rate in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high success rates of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Meta's Chameleon.
arXiv Detail & Related papers (2024-10-29T07:15:56Z) - AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs [34.221522224051846]
We propose an adaptive position pre-fill jailbreak attack approach for executing jailbreak attacks on Large Language Models (LLMs)
Our method leverages the model's instruction-following capabilities to first output safe content, then exploits its narrative-shifting abilities to generate harmful content.
Our method can improve the attack success rate by 47% on the widely recognized secure model (Llama2) compared to existing approaches.
arXiv Detail & Related papers (2024-09-11T00:00:58Z) - h4rm3l: A language for Composable Jailbreak Attack Synthesis [48.5611060845958]
h4rm3l is a novel approach that addresses the gap with a human-readable domain-specific language.
We show that h4rm3l's synthesized attacks are diverse and more successful than existing jailbreak attacks in literature.
arXiv Detail & Related papers (2024-08-09T01:45:39Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models [66.34505141027624]
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics.
WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks.
arXiv Detail & Related papers (2024-06-26T17:31:22Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.