VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization
- URL: http://arxiv.org/abs/2504.12661v1
- Date: Thu, 17 Apr 2025 05:46:41 GMT
- Title: VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization
- Authors: Menglan Chen, Xianghe Pang, Jingjing Dong, WenHao Wang, Yaxin Du, Siheng Chen,
- Abstract summary: We introduce VLMGuard-R1, a proactive framework that refines user inputs through a reasoning-guided rewriter.<n>VLMGuard-R1 achieves a remarkable 43.59% increase in average safety across five models on the SIUO benchmark.
- Score: 29.192704030072516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aligning Vision-Language Models (VLMs) with safety standards is essential to mitigate risks arising from their multimodal complexity, where integrating vision and language unveils subtle threats beyond the reach of conventional safeguards. Inspired by the insight that reasoning across modalities is key to preempting intricate vulnerabilities, we propose a novel direction for VLM safety: multimodal reasoning-driven prompt rewriting. To this end, we introduce VLMGuard-R1, a proactive framework that refines user inputs through a reasoning-guided rewriter, dynamically interpreting text-image interactions to deliver refined prompts that bolster safety across diverse VLM architectures without altering their core parameters. To achieve this, we devise a three-stage reasoning pipeline to synthesize a dataset that trains the rewriter to infer subtle threats, enabling tailored, actionable responses over generic refusals. Extensive experiments across three benchmarks with five VLMs reveal that VLMGuard-R1 outperforms four baselines. In particular, VLMGuard-R1 achieves a remarkable 43.59\% increase in average safety across five models on the SIUO benchmark.
Related papers
- Automating Steering for Safe Multimodal Large Language Models [36.99946524593795]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z) - MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models [17.824240702928133]
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities.<n>Existing safety alignment approaches fall short in addressing the complex and nuanced threats posed by multimodal inputs.<n>MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities.
arXiv Detail & Related papers (2025-06-24T02:37:59Z) - SafeCoT: Improving VLM Safety with Minimal Reasoning [5.452721786714111]
We introduce SafeCoT, a lightweight, interpretable framework to improve refusal behavior in vision-language models.<n>We show that SafeCoT significantly reduces overrefusal and enhances generalization, even with limited training data.
arXiv Detail & Related papers (2025-06-10T03:13:50Z) - SafeVid: Toward Safety Aligned Video Large Multimodal Models [60.14535756294228]
We introduce SafeVid, a framework designed to instill video-specific safety principles in Video Large Multimodal Models (VLMMs)<n>SafeVid employs detailed textual video descriptions as an interpretive bridge, facilitating rule-driven safety reasoning.<n> Alignment with SafeVid-350K significantly enhances VLMM safety, with models like LLaVA-NeXT-Video demonstrating substantial improvements.
arXiv Detail & Related papers (2025-05-17T09:21:33Z) - MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.
It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.
We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z) - Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models [34.66687625996389]
Multimodal large language models (MLLMs) are critical for developing general-purpose AI assistants, yet they face growing safety risks.<n>How can we ensure that MLLMs are safely aligned to prevent undesired behaviors such as discrimination, misinformation, or violations of ethical standards?<n>We propose Safe RLHF-V, the first multimodal safety alignment framework that jointly optimize helpfulness and safety.
arXiv Detail & Related papers (2025-03-22T07:40:20Z) - VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap [51.287157951953226]
Vision language models (VLMs) come with increased safety concerns.<n>VLMs can be built upon LLMs that have textual safety alignment, but it is easily undermined when the vision modality is integrated.<n>We propose VLM-Guard, an inference-time intervention strategy that leverages the LLM component of a VLM as supervision for the safety alignment of the VLM.
arXiv Detail & Related papers (2025-02-14T08:44:43Z) - Membership Inference Attacks Against Vision-Language Models [24.47069867575367]
Vision-Language Models (VLMs) have shown exceptional multi-modal understanding and dialog capabilities.
Risks of data misuse and leakage have been largely unexplored.
We propose four membership inference methods, each tailored to different levels of background knowledge.
arXiv Detail & Related papers (2025-01-27T05:44:58Z) - RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting [7.0595410083835315]
RapGuard is a novel framework that uses multimodal chain-of-thought reasoning to generate scenario-specific safety prompts.<n>RapGuard achieves state-of-the-art safety performance, significantly reducing harmful content without degrading the quality of responses.
arXiv Detail & Related papers (2024-12-25T08:31:53Z) - Retention Score: Quantifying Jailbreak Risks for Vision Language Models [60.48306899271866]
Vision-Language Models (VLMs) are integrated with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities.
This paper aims to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs.
To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the textbfRetention Score.
arXiv Detail & Related papers (2024-12-23T13:05:51Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - Safety Alignment for Vision Language Models [21.441662865727448]
We enhance the visual modality safety alignment of Vision Language Models (VLMs) by adding safety modules.
Our method boasts ease of use, high flexibility, and strong controllability, and it enhances safety while having minimal impact on the model's general performance.
arXiv Detail & Related papers (2024-05-22T12:21:27Z) - ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding [89.0074567748505]
We present reverse prompt contrastive decoding (ROSE), a simple-yet-effective method to boost the safety of existing instruction-tuned LLMs without any additional training.
Experiments on 6 safety and 2 general-purpose tasks show that, our ROSE not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned LLMs, but also benefits the general-purpose ability of LLMs.
arXiv Detail & Related papers (2024-02-19T06:58:42Z) - FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts [14.33139608409507]
We propose FigStep, a straightforward yet effective black-box jailbreak algorithm against LVLMs.<n>FigStep converts the prohibited content into images through typography to bypass the safety alignment.<n>Our work reveals that current LVLMs are vulnerable to jailbreak attacks, which highlights the necessity of novel cross-modality safety alignment techniques.
arXiv Detail & Related papers (2023-11-09T18:59:11Z) - On Evaluating Adversarial Robustness of Large Vision-Language Models [64.66104342002882]
We evaluate the robustness of large vision-language models (VLMs) in the most realistic and high-risk setting.
In particular, we first craft targeted adversarial examples against pretrained models such as CLIP and BLIP.
Black-box queries on these VLMs can further improve the effectiveness of targeted evasion.
arXiv Detail & Related papers (2023-05-26T13:49:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.