FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
- URL: http://arxiv.org/abs/2311.05608v3
- Date: Sun, 19 Jan 2025 16:23:38 GMT
- Title: FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
- Authors: Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, Xiaoyun Wang,
- Abstract summary: We propose FigStep, a straightforward yet effective black-box jailbreak algorithm against LVLMs.<n>FigStep converts the prohibited content into images through typography to bypass the safety alignment.<n>Our work reveals that current LVLMs are vulnerable to jailbreak attacks, which highlights the necessity of novel cross-modality safety alignment techniques.
- Score: 14.33139608409507
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Models (LVLMs) signify a groundbreaking paradigm shift within the Artificial Intelligence (AI) community, extending beyond the capabilities of Large Language Models (LLMs) by assimilating additional modalities (e.g., images). Despite this advancement, the safety of LVLMs remains adequately underexplored, with a potential overreliance on the safety assurances purported by their underlying LLMs. In this paper, we propose FigStep, a straightforward yet effective black-box jailbreak algorithm against LVLMs. Instead of feeding textual harmful instructions directly, FigStep converts the prohibited content into images through typography to bypass the safety alignment. The experimental results indicate that FigStep can achieve an average attack success rate of 82.50% on six promising open-source LVLMs. Not merely to demonstrate the efficacy of FigStep, we conduct comprehensive ablation studies and analyze the distribution of the semantic embeddings to uncover that the reason behind the success of FigStep is the deficiency of safety alignment for visual embeddings. Moreover, we compare FigStep with five text-only jailbreaks and four image-based jailbreaks to demonstrate the superiority of FigStep, i.e., negligible attack costs and better attack performance. Above all, our work reveals that current LVLMs are vulnerable to jailbreak attacks, which highlights the necessity of novel cross-modality safety alignment techniques. Our code and datasets are available at https://github.com/ThuCCSLab/FigStep .
Related papers
- Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary [2.4329261266984346]
Large Language Models (LLMs) are designed to generate helpful and safe content.
adversarial attacks, commonly referred to as jailbreak, can bypass their safety protocols.
We introduce a novel jailbreak attack method that leverages the prefilling feature of LLMs.
arXiv Detail & Related papers (2025-04-28T07:38:43Z) - FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts [20.323340637767327]
We propose a jailbreak attack method based on auto-generated flowcharts, FC-Attack.
FC-Attack achieves over 90% attack success rates on Gemini-1.5, Llaval-Next, Qwen2-VL, and InternVL-2.5 models.
To mitigate the attack, we explore several defenses and find that AdaShield can largely reduce the jailbreak performance but with the cost of utility drop.
arXiv Detail & Related papers (2025-02-28T13:59:11Z) - Zero-Shot Defense Against Toxic Images via Inherent Multimodal Alignment in LVLMs [10.463762448166714]
We introduce SafeCLIP, a lightweight method that leverages LVLMs inherent multimodal alignment for zero-shot toxic image detection.
Experiments show that SafeCLIP achieves a 66.9% defense success rate with only 3.2% false positive rate and 7.2% overhead.
Our work demonstrates that leveraging inherent multimodal alignment can yield efficient, low-cost LVLM safety.
arXiv Detail & Related papers (2025-02-25T06:51:16Z) - Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models [92.79804303337522]
Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues.
We introduce MLAI, a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment.
Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2.
arXiv Detail & Related papers (2024-11-27T02:40:29Z) - Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models [80.77246856082742]
Safety Snowball Agent (SSA) is a novel agent-based framework leveraging agents' autonomous and tool-using abilities to jailbreak LVLMs.
Our experiments demonstrate that ours can use nearly any image to induce LVLMs to produce unsafe content, achieving high success jailbreaking rates against the latest LVLMs.
arXiv Detail & Related papers (2024-11-18T11:58:07Z) - IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves [64.46372846359694]
IDEATOR is a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks.
Our benchmark results on 11 recently releasedVLMs reveal significant gaps in safety alignment.
For instance, our challenge set achieves ASRs of 46.31% on GPT-4o and 19.65% on Claude-3.5-Sonnet.
arXiv Detail & Related papers (2024-10-29T07:15:56Z) - ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time [12.160713548659457]
adversarial visual inputs can easily bypass VLM defense mechanisms.
We propose a novel two-phase inference-time alignment framework, evaluating input visual contents and output responses.
Experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency.
arXiv Detail & Related papers (2024-10-09T07:21:43Z) - PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach [25.31933913962953]
Large Language Models (LLMs) have gained widespread use, raising concerns about their security.
We introduce PathSeeker, a novel black-box jailbreak method, which is inspired by the game of rats escaping a maze.
Our method outperforms five state-of-the-art attack techniques when tested across 13 commercial and open-source LLMs.
arXiv Detail & Related papers (2024-09-21T15:36:26Z) - CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.36429361299807]
multimodal large language models (MLLMs) have demonstrated remarkable success in engaging in conversations involving visual inputs.
The integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs.
We introduce a technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution.
arXiv Detail & Related papers (2024-09-17T17:14:41Z) - AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs [34.221522224051846]
We propose an adaptive position pre-fill jailbreak attack approach for executing jailbreak attacks on Large Language Models (LLMs)
Our method leverages the model's instruction-following capabilities to first output safe content, then exploits its narrative-shifting abilities to generate harmful content.
Our method can improve the attack success rate by 47% on the widely recognized secure model (Llama2) compared to existing approaches.
arXiv Detail & Related papers (2024-09-11T00:00:58Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks.
Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z) - Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively.
In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z) - Safety Alignment for Vision Language Models [21.441662865727448]
We enhance the visual modality safety alignment of Vision Language Models (VLMs) by adding safety modules.
Our method boasts ease of use, high flexibility, and strong controllability, and it enhances safety while having minimal impact on the model's general performance.
arXiv Detail & Related papers (2024-05-22T12:21:27Z) - Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models [107.88745040504887]
We study the harmlessness alignment problem of multimodal large language models (MLLMs)
Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input.
Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision.
arXiv Detail & Related papers (2024-03-14T18:24:55Z) - Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [98.02846901473697]
We propose ECSO (Eyes Closed, Safety On), a training-free protecting approach that exploits the inherent safety awareness of MLLMs.
ECSO generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs.
arXiv Detail & Related papers (2024-03-14T17:03:04Z) - AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions [52.9787902653558]
Large Vision-Language Models (LVLMs) have shown significant progress in well responding to visual-instructions from users.
Despite the critical importance of LVLMs' robustness against such threats, current research in this area remains limited.
We introduce AVIBench, a framework designed to analyze the robustness of LVLMs when facing various adversarial visual-instructions.
arXiv Detail & Related papers (2024-03-14T12:51:07Z) - On Evaluating Adversarial Robustness of Large Vision-Language Models [64.66104342002882]
We evaluate the robustness of large vision-language models (VLMs) in the most realistic and high-risk setting.
In particular, we first craft targeted adversarial examples against pretrained models such as CLIP and BLIP.
Black-box queries on these VLMs can further improve the effectiveness of targeted evasion.
arXiv Detail & Related papers (2023-05-26T13:49:44Z) - Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes.
To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.