Fugu-MT 論文翻訳(概要): Latent-space Attacks for Refusal Evasion in Language Models

論文の概要: Latent-space Attacks for Refusal Evasion in Language Models

arxiv url: http://arxiv.org/abs/2605.21706v1
Date: Wed, 20 May 2026 20:10:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 20:14:18.486033
Title: Latent-space Attacks for Refusal Evasion in Language Models
Title（参考訳）: 言語モデルにおける遅延空間攻撃
Authors: Giorgio Piras, Raffaele Mura, Fabio Brau, Maura Pintor, Luca Oneto, Fabio Roli, Battista Biggio,
Abstract要約: 我々は,リフレクションをリフレクションから分離するよう訓練された線形プローブに対する遅延空間回避攻撃として再放送した。我々は15の命令調整、マルチモーダル、推論モデルに対して、最先端の攻撃成功率を達成する。
参考スコア（独自算出の注目度）: 14.290157825353846
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal representations. Existing methods do so by ablating a refusal direction from model activations, aiming to remove refusal from the model's residual stream. Despite their empirical success, these methods lack a principled account of the latent-space transformation they induce and why it suppresses refusal. In this work, we recast refusal suppression as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. Under this view, prior work's difference-in-means direction naturally defines such a probe, and its ablation is exactly a projection onto its decision boundary, i.e., a minimum-confidence evasion attack. This perspective not only explains the empirical success of prior work but also admits a key limitation: evasion stops at the decision boundary, motivating the need to push representations further into the compliant region, i.e., where the model answers. We leverage this by proposing a Controlled Latent-space Evasion attack that projects representations past the boundary with an optimized confidence. We achieve state-of-the-art attack success rate across 15 instruction-tuned, multimodal, and reasoning models, outperforming existing refusal-ablation baselines and specialized jailbreak attacks.
Abstract（参考訳）: 安全に配慮した言語モデルは有害な要求を拒否するように訓練されているが、内部表現を操ることで拒否行動を抑制することができる。既存の方法は、モデルアクティベーションからの拒絶方向を非難し、モデルの残留ストリームからの拒絶を除去することを目的としている。実証的な成功にもかかわらず、これらの手法は、それらが引き起こす潜在空間変換と、それが拒絶を抑制する理由の原則的な説明を欠いている。本研究では,リフレクションをリフレクションから分離するために訓練された線形プローブに対する遅延空間回避攻撃として再放送する。この観点では、先行作業の差分方向は自然にそのようなプローブを定義し、そのアブレーションはその決定境界、すなわち最小信頼回避攻撃への射影である。この視点は、先行作業の実証的な成功を説明するだけでなく、重要な制限も認める: 回避は決定境界で停止し、表現を従属領域、すなわちモデルが答える領域にさらに押し込む必要性を動機付ける。我々はこれを、最適化された信頼で境界を越えて表現を投影する制御されたラテント空間の侵入攻撃を提案することで活用する。我々は15の命令調整、マルチモーダル、推論モデルにおける最先端の攻撃成功率を達成し、既存の拒絶-アブレーションベースラインと特別なジェイルブレイク攻撃を上回った。

論文の概要: Latent-space Attacks for Refusal Evasion in Language Models

関連論文リスト