SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models
- URL: http://arxiv.org/abs/2510.06871v2
- Date: Thu, 09 Oct 2025 13:46:31 GMT
- Title: SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models
- Authors: Huahui Yi, Kun Wang, Qiankun Li, Miao Yu, Liang Lin, Gongli Xi, Hao Wu, Xuming Hu, Kang Li, Yang Liu,
- Abstract summary: Multimodal Large Reasoning Models (MLRMs) demonstrate impressive crossmodal reasoning but often amplify safety risks under adversarial prompts.<n> Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models to implicit risks.<n>We propose SaFeR-VLM, which integrates four components and supports dynamic and interpretable safety decisions beyond surface-level filtering.
- Score: 66.71948519280669
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance $70.13$ and $78.97$ on safety and helpfulness across six benchmarks, surpassing both same-scale and $>10\times$ larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at https://github.com/HarveyYi/SaFeR-VLM.
Related papers
- ProGuard: Towards Proactive Multimodal Safeguard [48.89789547707647]
ProGuard is a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks.<n>We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories.<n>We then train our vision-language base model purely through reinforcement learning to achieve efficient and concise reasoning.
arXiv Detail & Related papers (2025-12-29T16:13:23Z) - Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models [58.17589701432514]
Think-Reflect-Revise (TRR) is a training framework designed to enhance the safety alignment of Large Vision Language Models (LVLMs)<n>We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process.<n>We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning.
arXiv Detail & Related papers (2025-12-08T03:46:03Z) - DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models [59.45605332033458]
Safety mechanisms can backfire, causing over-refusal, where models decline benign requests out of excessive caution.<n>No existing benchmark has systematically addressed over-refusal in the visual modality.<n>This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content.
arXiv Detail & Related papers (2025-10-12T23:21:34Z) - SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law [91.33824439029533]
We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety.<n>It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training.<n>We further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B.
arXiv Detail & Related papers (2025-07-24T16:49:19Z) - Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model [29.63418384788804]
We conduct a safety evaluation of 11 Multimodal Large Reasoning Models (MLRMs) across 5 benchmarks.<n>Our analysis reveals distinct safety patterns across different benchmarks.<n>It is a potential approach to address safety issues in MLRMs by leveraging the intrinsic reasoning capabilities of the model to detect unsafe intent.
arXiv Detail & Related papers (2025-05-10T06:59:36Z) - Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback [34.01716144973483]
Multimodal large language models (MLLMs) are essential for building general-purpose AI assistants.<n>How can we ensure safety alignment of MLLMs to prevent undesired behaviors?<n>In this work, we present the first exploration of the Safe RLHF-V -- the first multimodal safety alignment framework.
arXiv Detail & Related papers (2025-03-22T07:40:20Z) - The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [70.94607997570729]
We present a comprehensive safety assessment of OpenAI-o3 and DeepSeek-R1 reasoning models.<n>We investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications.
arXiv Detail & Related papers (2025-02-18T09:06:07Z) - Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models [25.606641582511106]
We propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance.<n>Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks.
arXiv Detail & Related papers (2025-01-30T17:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.