Related papers: $\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

URL: http://arxiv.org/abs/2408.08464v4
Date: Tue, 22 Oct 2024 02:01:16 GMT
Title: $\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models
Authors: Fenghua Weng, Yue Xu, Chengyan Fu, Wenjie Wang,
Abstract summary: We introduce textitMMJ-Bench, a unified pipeline for evaluating jailbreak attacks and defense techniques for MLLMs. We assess the effectiveness of various attack methods against SoTA MLLMs and evaluate the impact of defense mechanisms on both defense effectiveness and model utility.
Score: 11.02754617539271
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As deep learning advances, Large Language Models (LLMs) and their multimodal counterparts, Multimodal Large Language Models (MLLMs), have shown exceptional performance in many real-world tasks. However, MLLMs face significant security challenges, such as jailbreak attacks, where attackers attempt to bypass the model's safety alignment to elicit harmful responses. The threat of jailbreak attacks on MLLMs arises from both the inherent vulnerabilities of LLMs and the multiple information channels that MLLMs process. While various attacks and defenses have been proposed, there is a notable gap in unified and comprehensive evaluations, as each method is evaluated on different dataset and metrics, making it impossible to compare the effectiveness of each method. To address this gap, we introduce \textit{MMJ-Bench}, a unified pipeline for evaluating jailbreak attacks and defense techniques for MLLMs. Through extensive experiments, we assess the effectiveness of various attack methods against SoTA MLLMs and evaluate the impact of defense mechanisms on both defense effectiveness and model utility for normal tasks. Our comprehensive evaluation contribute to the field by offering a unified and systematic evaluation framework and the first public-available benchmark for MLLM jailbreak research. We also demonstrate several insightful findings that highlights directions for future studies.

Related papers

Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models [83.80177564873094]
We propose a unified multimodal universal jailbreak attack framework.<n>We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP.<n>This study underscores the urgent need for robust safety measures in MLLMs.
arXiv Detail & Related papers (2025-06-02T04:33:56Z)
Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities [76.9327488986162]
Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images.<n>We exploit the capabilities of MLLMs to interpret non-textual instructions, specifically, adversarial images or audio generated by our novel method, Con Instruction.<n>Our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B)
arXiv Detail & Related papers (2025-05-31T13:11:14Z)
PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks [7.252454104194306]
Large language models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial prompts known as jailbreaks.<n>Despite growing efforts in LLM safety research, existing evaluations are often fragmented, focused on isolated attack or defense techniques.<n>We introduce PandaGuard, a unified and modular framework that models LLM jailbreak safety as a multi-agent system comprising attackers, defenders, and judges.
arXiv Detail & Related papers (2025-05-20T03:14:57Z)
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation [88.78166077081912]
We introduce a multimodal unlearning benchmark, UnLOK-VQA, and an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs.<n>Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states.
arXiv Detail & Related papers (2025-05-01T01:54:00Z)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment. Certain scenarios suffer 25 times higher attack rates. Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models. It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines. We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Towards Robust Multimodal Large Language Models Against Jailbreak Attacks [24.491648943977605]
We introduce SafeMLLM, which alternates between an attack step for generating adversarial noise and a model updating step. At the attack step, SafeMLLM generates adversarial perturbations through a newly proposed contrastive embedding attack (CoE-Attack) We evaluate SafeMLLM across six MLLMs and six jailbreak methods spanning multiple modalities.
arXiv Detail & Related papers (2025-02-02T03:45:49Z)
You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense [34.023473699165315]
We study the utility degradation, safety elevation, and exaggerated-safety escalation of LLMs with jailbreak defense strategies. We find that mainstream jailbreak defenses fail to ensure both safety and performance simultaneously.
arXiv Detail & Related papers (2025-01-21T15:24:29Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models [17.663550432103534]
Multimodal Large Language Models (MLLMs) extend the capacity of LLMs to understand multimodal information comprehensively. These models are susceptible to jailbreak attacks, where malicious users can break the safety alignment of the target model and generate misleading and harmful answers. We propose Cross-modality Information DEtectoR (CIDER), a plug-and-play jailbreaking detector designed to identify maliciously perturbed image inputs.
arXiv Detail & Related papers (2024-07-31T15:02:46Z)
A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends [78.3201480023907]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal understanding and reasoning tasks. The vulnerability of LVLMs is relatively underexplored, posing potential security risks in daily usage. In this paper, we provide a comprehensive review of the various forms of existing LVLM attacks.
arXiv Detail & Related papers (2024-07-10T06:57:58Z)
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking [32.300594239333236]
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have exposed vulnerabilities to various adversarial attacks. This paper provides a comprehensive overview of jailbreaking research targeting both LLMs and MLLMs, highlighting recent advancements in evaluation benchmarks, attack techniques and defense strategies.
arXiv Detail & Related papers (2024-06-21T04:33:48Z)
Prompt Leakage effect and defense strategies for multi-turn LLM interactions [95.33778028192593]
Leakage of system prompts may compromise intellectual property and act as adversarial reconnaissance for an attacker. We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting. We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts.
arXiv Detail & Related papers (2024-04-24T23:39:58Z)
Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes [0.0]
Large Language Models (LLMs) have gained widespread adoption across various domains, including chatbots and auto-task completion agents. These models are susceptible to safety vulnerabilities such as jailbreaking, prompt injection, and privacy leakage attacks. This study investigates the impact of these modifications on LLM safety, a critical consideration for building reliable and secure AI systems.
arXiv Detail & Related papers (2024-04-05T20:31:45Z)
JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks [24.69275959735538]
This paper investigates whether techniques that successfully jailbreak Large Language Models can be equally effective in jailbreaking MLLMs. We introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs. We generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks.
arXiv Detail & Related papers (2024-04-03T19:23:18Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs) It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z)
Safety of Multimodal Large Language Models on Images and Texts [33.97489213223888]
In this paper, we systematically survey current efforts on the evaluation, attack, and defense of MLLMs' safety on images and text. We review the evaluation datasets and metrics for measuring the safety of MLLMs. Next, we comprehensively present attack and defense techniques related to MLLMs' safety.
arXiv Detail & Related papers (2024-02-01T05:57:10Z)
Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks [72.03945355787776]
We advocate MDP, a lightweight, pluggable, and effective defense for PLMs as few-shot learners. We show analytically that MDP creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness.
arXiv Detail & Related papers (2023-09-23T04:41:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.