Fugu-MT 論文翻訳(概要): SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

論文の概要: SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2605.11716v1
Date: Tue, 12 May 2026 08:05:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.691532
Title: SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models
Title（参考訳）: SafeSteer: マルチモーダル大規模言語モデルのためのデコードレベルの防御機構
Authors: Xinyi Zeng, Xue Yang, Jingyuan Zhang, Huanqian Yan, Xiang Chen, Kaiwen Wei, Hankun Kang, Yu Tian,
Abstract要約: 本稿では,MLLMの復号レベル防衛機構であるSafeSteerを紹介する。復号中に有害な出力を検出し修正するための復号プローブを含む。 MLLMの安全性は、微調整なしで最大33.40%向上できる。
参考スコア（独自算出の注目度）: 30.79900292985646
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) are gaining increasing attention. Due to the heterogeneity of their input features, they face significant challenges in terms of jailbreak defenses. Current defense methods rely on costly fine-tuning or inefficient post-hoc interventions, limiting their ability to address novel attacks and involving performance trade-offs. To address the above issues, we explore the inherent safety capabilities within MLLMs and quantify their intrinsic ability to discern harmfulness at decoding stage. We observe that 1) MLLMs can distinguish the harmful and harmless inputs during decoding process, 2) Image-based attacks are more stealthy. Based on these insights, we introduce SafeSteer, a decoding-level defense mechanism for MLLMs. Specifically, it includes a Decoding-Probe, a lightweight probe for detecting and correcting harmful output during decoding, which iteratively steers the decoding process toward safety. Furthermore, a modal semantic alignment vector is integrated to transfer the strong textual safety alignment to the vision modality. Experiments on multiple MLLMs demonstrate that SafeSterr can improve MLLMs' safety by up to 33.40\% without fine-tuning. Notably, it can maintain the effectiveness of MLLMs, ensuring a balance between their helpfulness and harmlessness.
Abstract（参考訳）: MLLM(Multimodal large language model)が注目されている。入力機能の異質性のため、彼らはジェイルブレイク防御の点で重大な課題に直面している。現在の防衛方法は、コストのかかる微調整や非効率なポストホックの介入に依存しており、新しい攻撃に対処する能力やパフォーマンスのトレードオフに対処する能力を制限する。以上の課題に対処するため, MLLM内の本質的な安全性機能について検討し, 復号段階で有害性を識別する本質的な能力を定量化する。私たちはそれを観察する 1)MLLMは復号過程において有害かつ無害な入力を区別することができる。 2)画像ベースの攻撃はよりステルス性が高い。これらの知見に基づいて,MLLMの復号レベル防衛機構であるSafeSteerを紹介する。具体的には、復号中に有害な出力を検出し、修正するための軽量プローブである復号プローブを含み、復号処理を安全に向けて反復的に操縦する。さらに、モーダルセマンティックアライメントベクトルを統合して、強いテキストの安全性アライメントを視覚のモダリティに転送する。複数のMLLMの実験では、SafeSterrは微調整なしでMLLMの安全性を最大33.40\%向上できることを示した。特に、MLLMの有効性を維持し、その有用性と無害性のバランスを確保することができる。

論文の概要: SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

関連論文リスト