Related papers: Steering Multimodal Large Language Models Decoding for Context-Aware Safety

Steering Multimodal Large Language Models Decoding for Context-Aware Safety

URL: http://arxiv.org/abs/2509.19212v1
Date: Tue, 23 Sep 2025 16:32:25 GMT
Title: Steering Multimodal Large Language Models Decoding for Context-Aware Safety
Authors: Zheyuan Liu, Zhangchen Xu, Guangyao Dou, Xiangchi Yuan, Zhaoxuan Tan, Radha Poovendran, Meng Jiang,
Abstract summary: Multimodal Large Language Models (MLLMs) are increasingly deployed in real-world applications.<n>Existing methods fail to balance oversensitivity (unjustified refusals of benign queries) and undersensitivity (missed detection of visually grounded risks)<n>We introduce Safety-aware Contrastive Decoding (SafeCoDe), a lightweight and model-agnostic decoding framework that dynamically adjusts token generation based on multimodal context.
Score: 40.668741064553025
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Multimodal Large Language Models (MLLMs) are increasingly deployed in real-world applications, yet their ability to make context-aware safety decisions remains limited. Existing methods often fail to balance oversensitivity (unjustified refusals of benign queries) and undersensitivity (missed detection of visually grounded risks), leaving a persistent gap in safety alignment. To address this issue, we introduce Safety-aware Contrastive Decoding (SafeCoDe), a lightweight and model-agnostic decoding framework that dynamically adjusts token generation based on multimodal context. SafeCoDe operates in two stages: (1) a contrastive decoding mechanism that highlights tokens sensitive to visual context by contrasting real and Gaussian-noised images, and (2) a global-aware token modulation strategy that integrates scene-level reasoning with token-level adjustment to adapt refusals according to the predicted safety verdict. Extensive experiments across diverse MLLM architectures and safety benchmarks, covering undersensitivity, oversensitivity, and general safety evaluations, show that SafeCoDe consistently improves context-sensitive refusal behaviors while preserving model helpfulness.

Related papers

STaR: Sensitive Trajectory Regulation for Unlearning in Large Reasoning Models [12.133996629992318]
We propose a parameter-free, inference-time unlearning framework that achieves robust privacy protection throughout the reasoning process.<n>Experiments on the R-TOFU benchmark demonstrate that STaR achieves comprehensive and stable unlearning with minimal utility loss.
arXiv Detail & Related papers (2026-01-14T08:35:23Z)
SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models [67.84174763413178]
We introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection.<n>We show that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks.
arXiv Detail & Related papers (2026-01-13T15:01:38Z)
SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization [79.14563283347773]
Multimodal large language models (MLLMs) have demonstrated impressive reasoning and instruction-following capabilities.<n>Cross-modal couplings can produce unsafe semantics even when individual inputs are benign.<n>We propose SafeGRPO, a self-rewarded multimodal safety alignment framework.
arXiv Detail & Related papers (2025-11-17T05:09:49Z)
Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations [94.62792643569567]
This work systematically investigates the role of speaker emotion.<n>We construct a dataset of malicious speech instructions expressed across multiple emotions and intensities, and evaluate several state-of-the-art LALMs.<n>Our results reveal substantial safety inconsistencies: different emotions elicit varying levels of unsafe responses, and the effect of intensity is non-monotonic, with medium expressions often posing the greatest risk.
arXiv Detail & Related papers (2025-10-19T15:41:25Z)
AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models [6.059681491089391]
AURA provides comprehensive, step level evaluations across logical coherence and safety-awareness.<n>Our framework seamlessly combines introspective self-critique, fine-grained PRM assessments, and adaptive safety-aware decoding.<n>This research represents a pivotal step toward safer, more responsible, and contextually aware AI, setting a new benchmark for alignment-sensitive applications.
arXiv Detail & Related papers (2025-08-08T08:43:24Z)
Automating Steering for Safe Multimodal Large Language Models [58.36932318051907]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z)
Advancing Neural Network Verification through Hierarchical Safety Abstract Interpretation [52.626086874715284]
We introduce a novel problem formulation called Abstract DNN-Verification, which verifies a hierarchical structure of unsafe outputs.<n>By leveraging abstract interpretation and reasoning about output reachable sets, our approach enables assessing multiple safety levels during the formal verification process.<n>Our contributions include a theoretical exploration of the relationship between our novel abstract safety formulation and existing approaches.
arXiv Detail & Related papers (2025-05-08T13:29:46Z)
Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety.<n>For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context.<n>We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z)
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models [5.6874111521946356]
Safety-aligned language models often exhibit fragile and imbalanced safety mechanisms.<n>We propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy.<n>HarmEval is a novel benchmark for extensive safety evaluations.
arXiv Detail & Related papers (2024-06-18T05:03:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.