RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting
- URL: http://arxiv.org/abs/2412.18826v1
- Date: Wed, 25 Dec 2024 08:31:53 GMT
- Title: RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting
- Authors: Yilei Jiang, Yingshui Tan, Xiangyu Yue,
- Abstract summary: RapGuard is a novel framework that uses multimodal chain-of-thought reasoning to generate scenario-specific safety prompts.<n>RapGuard achieves state-of-the-art safety performance, significantly reducing harmful content without degrading the quality of responses.
- Score: 7.0595410083835315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Multimodal Large Language Models (MLLMs) have made remarkable progress in vision-language reasoning, they are also more susceptible to producing harmful content compared to models that focus solely on text. Existing defensive prompting techniques rely on a static, unified safety guideline that fails to account for the specific risks inherent in different multimodal contexts. To address these limitations, we propose RapGuard, a novel framework that uses multimodal chain-of-thought reasoning to dynamically generate scenario-specific safety prompts. RapGuard enhances safety by adapting its prompts to the unique risks of each input, effectively mitigating harmful outputs while maintaining high performance on benign tasks. Our experimental results across multiple MLLM benchmarks demonstrate that RapGuard achieves state-of-the-art safety performance, significantly reducing harmful content without degrading the quality of responses.
Related papers
- Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models [17.824240702928133]
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities.<n>Existing safety alignment approaches fall short in addressing the complex and nuanced threats posed by multimodal inputs.<n>MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities.
arXiv Detail & Related papers (2025-06-24T02:37:59Z) - Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models [11.867355323884217]
We present a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments.<n>Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency.
arXiv Detail & Related papers (2025-06-20T05:30:25Z) - Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models [83.80177564873094]
We propose a unified multimodal universal jailbreak attack framework.<n>We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP.<n>This study underscores the urgent need for robust safety measures in MLLMs.
arXiv Detail & Related papers (2025-06-02T04:33:56Z) - SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues [9.762621950740995]
Malicious attackers can exploit large language models (LLMs) by engaging them in multi-turn dialogues.<n>We propose a novel defense mechanism: SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues (STREAM)
arXiv Detail & Related papers (2025-05-31T18:38:23Z) - MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety [56.79292318645454]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking.<n>This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited.<n>We introduce a multilingual guardrail with reasoning for prompt classification.
arXiv Detail & Related papers (2025-04-21T17:15:06Z) - Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.
We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z) - MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.
It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.
We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z) - Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks [55.29301192316118]
Large language models (LLMs) are highly vulnerable to jailbreaking attacks.
We propose a safety steering framework grounded in safe control theory.
Our method achieves invariant safety at each turn of dialogue by learning a safety predictor.
arXiv Detail & Related papers (2025-02-28T21:10:03Z) - Universal Adversarial Attack on Aligned Multimodal LLMs [1.5146068448101746]
We propose a universal adversarial attack on multimodal Large Language Models (LLMs)
We craft a synthetic image that forces the model to respond with a targeted phrase or otherwise unsafe content.
We will release code and datasets under the Apache-2.0 license.
arXiv Detail & Related papers (2025-02-11T22:07:47Z) - SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs.
Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol.
Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z) - RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks.
This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z) - MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models [41.708401515627784]
We observe that Multimodal Large Language Models (MLLMs) can be easily compromised by query-relevant images.
We introduce MM-SafetyBench, a framework designed for conducting safety-critical evaluations of MLLMs against such image-based manipulations.
Our work underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source MLLMs against potential malicious exploits.
arXiv Detail & Related papers (2023-11-29T12:49:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.