FigStep: Jailbreaking Large Vision-language Models via Typographic
Visual Prompts
- URL: http://arxiv.org/abs/2311.05608v2
- Date: Wed, 13 Dec 2023 17:54:16 GMT
- Title: FigStep: Jailbreaking Large Vision-language Models via Typographic
Visual Prompts
- Authors: Yichen Gong and Delong Ran and Jinyuan Liu and Conglei Wang and
Tianshuo Cong and Anyu Wang and Sisi Duan and Xiaoyun Wang
- Abstract summary: We propose FigStep, a jailbreaking algorithm against large vision-language models (VLMs)
Instead of feeding textual harmful instructions directly, FigStep converts the harmful content into images through typography.
FigStep can achieve an average attack success rate of 82.50% on 500 harmful queries in 10 topics.
- Score: 14.948652267916149
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring the safety of artificial intelligence-generated content (AIGC) is a
longstanding topic in the artificial intelligence (AI) community, and the
safety concerns associated with Large Language Models (LLMs) have been widely
investigated. Recently, large vision-language models (VLMs) represent an
unprecedented revolution, as they are built upon LLMs but can incorporate
additional modalities (e.g., images). However, the safety of VLMs lacks
systematic evaluation, and there may be an overconfidence in the safety
guarantees provided by their underlying LLMs. In this paper, to demonstrate
that introducing additional modality modules leads to unforeseen AI safety
issues, we propose FigStep, a straightforward yet effective jailbreaking
algorithm against VLMs. Instead of feeding textual harmful instructions
directly, FigStep converts the harmful content into images through typography
to bypass the safety alignment within the textual module of the VLMs, inducing
VLMs to output unsafe responses that violate common AI safety policies. In our
evaluation, we manually review 46,500 model responses generated by 3 families
of the promising open-source VLMs, i.e., LLaVA, MiniGPT4, and CogVLM (a total
of 6 VLMs). The experimental results show that FigStep can achieve an average
attack success rate of 82.50% on 500 harmful queries in 10 topics. Moreover, we
demonstrate that the methodology of FigStep can even jailbreak GPT-4V, which
already leverages an OCR detector to filter harmful queries. Above all, our
work reveals that VLMs are vulnerable to jailbreaking attacks, which highlights
the necessity of novel safety alignments between visual and textual modalities.
Related papers
- Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models [92.79804303337522]
Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues.
We introduce MLAI, a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment.
Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2.
arXiv Detail & Related papers (2024-11-27T02:40:29Z) - Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models [80.77246856082742]
Safety Snowball Agent (SSA) is a novel agent-based framework leveraging agents' autonomous and tool-using abilities to jailbreak LVLMs.
Our experiments demonstrate that ours can use nearly any image to induce LVLMs to produce unsafe content, achieving high success jailbreaking rates against the latest LVLMs.
arXiv Detail & Related papers (2024-11-18T11:58:07Z) - Diversity Helps Jailbreak Large Language Models [16.34618038553998]
We have uncovered a powerful jailbreak technique that leverages large language models' ability to diverge from prior context.
By simply instructing the LLM to deviate and obfuscate previous attacks, our method dramatically outperforms existing approaches.
This revelation exposes a critical flaw in current LLM safety training, suggesting that existing methods may merely mask vulnerabilities rather than eliminate them.
arXiv Detail & Related papers (2024-11-06T19:39:48Z) - IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves [67.30731020715496]
We propose a novel jailbreak method named IDEATOR, which autonomously generates malicious image-text pairs for black-box jailbreak attacks.
IDEATOR uses a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model.
It achieves a 94% success rate in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high success rates of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Meta's Chameleon.
arXiv Detail & Related papers (2024-10-29T07:15:56Z) - AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs [34.221522224051846]
We propose an adaptive position pre-fill jailbreak attack approach for executing jailbreak attacks on Large Language Models (LLMs)
Our method leverages the model's instruction-following capabilities to first output safe content, then exploits its narrative-shifting abilities to generate harmful content.
Our method can improve the attack success rate by 47% on the widely recognized secure model (Llama2) compared to existing approaches.
arXiv Detail & Related papers (2024-09-11T00:00:58Z) - Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks.
Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z) - Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively.
In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z) - Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models [107.88745040504887]
We study the harmlessness alignment problem of multimodal large language models (MLLMs)
Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input.
Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision.
arXiv Detail & Related papers (2024-03-14T18:24:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.