Fugu-MT 論文翻訳(概要): SafeSteer: Adaptive Subspace Steering for Efficient Jailbreak Defense in Vision-Language Models

論文の概要: SafeSteer: Adaptive Subspace Steering for Efficient Jailbreak Defense in Vision-Language Models

arxiv url: http://arxiv.org/abs/2509.21400v1
Date: Wed, 24 Sep 2025 12:46:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:53.907564
Title: SafeSteer: Adaptive Subspace Steering for Efficient Jailbreak Defense in Vision-Language Models
Title（参考訳）: SafeSteer:視覚言語モデルにおける効率的なジェイルブレイク防御のための適応サブスペースステアリング
Authors: Xiyu Zeng, Siyuan Liang, Liming Lu, Haotian Zhu, Enguang Liu, Jisheng Dang, Yongbin Zhou, Shuchao Pang,
Abstract要約: 軽量な推論時ステアリングフレームワークであるSafeSteerを提案する。 SafeSteerは攻撃成功率を60%以上削減し,通常のタスクの精度を1～2%向上することを示す。
参考スコア（独自算出の注目度）: 25.027627636905475
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As the capabilities of Vision Language Models (VLMs) continue to improve, they are increasingly targeted by jailbreak attacks. Existing defense methods face two major limitations: (1) they struggle to ensure safety without compromising the model's utility; and (2) many defense mechanisms significantly reduce the model's inference efficiency. To address these challenges, we propose SafeSteer, a lightweight, inference-time steering framework that effectively defends against diverse jailbreak attacks without modifying model weights. At the core of SafeSteer is the innovative use of Singular Value Decomposition to construct a low-dimensional "safety subspace." By projecting and reconstructing the raw steering vector into this subspace during inference, SafeSteer adaptively removes harmful generation signals while preserving the model's ability to handle benign inputs. The entire process is executed in a single inference pass, introducing negligible overhead. Extensive experiments show that SafeSteer reduces the attack success rate by over 60% and improves accuracy on normal tasks by 1-2%, without introducing significant inference latency. These results demonstrate that robust and practical jailbreak defense can be achieved through simple, efficient inference-time control.
Abstract（参考訳）: ヴィジュアル言語モデル(VLM)の能力は向上を続けており、ジェイルブレイク攻撃の標的となっている。既存の防御方法は,(1)モデルの実用性を損なうことなく安全性を確保するのに苦慮し,(2)モデルの推論効率を著しく低下させる。これらの課題に対処するために、モデルウェイトを変更することなく、多様なジェイルブレイク攻撃を効果的に防御する軽量な推論時ステアリングフレームワークであるSafeSteerを提案する。 SafeSteerの中核は、低次元の「安全部分空間」を構築するためにSingular Value Decomposition(Singular Value Decomposition)の革新的な利用である。推論中に生のステアリングベクトルをこの部分空間に投影して再構成することにより、SafeSteerは、良性入力を処理するモデルの能力を維持しながら有害な生成信号を適応的に除去する。プロセス全体が単一の推論パスで実行され、無視可能なオーバーヘッドが生じる。大規模な実験の結果,SafeSteerは攻撃成功率を60%以上削減し,通常のタスクの精度を1～2%向上する。これらの結果は、シンプルで効率的な推論時間制御によって、堅牢で実用的なジェイルブレイク防御が達成できることを示している。

論文の概要: SafeSteer: Adaptive Subspace Steering for Efficient Jailbreak Defense in Vision-Language Models

関連論文リスト