Fugu-MT 論文翻訳(概要): The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

論文の概要: The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

arxiv url: http://arxiv.org/abs/2606.22686v1
Date: Sun, 21 Jun 2026 22:04:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 07:36:34.782246
Title: The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs
Title（参考訳）: 拒絶の幾何学:安全に配慮したLLMにおける線形不安定性
Authors: Shivam Ratnakar, Kartikeya Vats,
Abstract要約: 我々は、"拒絶方向"を分離するゼロ最適化フレームワークであるContrastive Logit Steering (CLS)を紹介した。 CLSは出力分布を直接操作し、アライメントの診断プローブとして機能する。 7つのモデルファミリーに関する実験により、安全実装がアーキテクチャ上決定論的であることが判明した。
参考スコア（独自算出の注目度）: 1.4323566945483497
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern Large Language Models (LLMs) rely on extensive safety alignment, yet the mechanistic basis of refusal remains opaque. In this work, we investigate whether safety compliance is a deep semantic decision or a manipulable linear feature. We introduce Contrastive Logit Steering (CLS), a zero-optimization framework that isolates the "refusal direction" by contrasting hidden states derived from safe and unrestricted system prompts. Unlike representation engineering methods that intervene on internal activations, CLS operates directly on the output distribution, serving as a diagnostic probe for alignment fragility. When coupled with prefix injection to bypass initial refusal reflexes, this method induces a phase transition where guardrails collapse. Our experiments on 7 model families reveal that safety implementation is architecturally deterministic. While models like Llama-3.1 exhibit a "Late Decision" topology that is easily bypassed by CLS (reaching 95% ASR in approximately one second), others like Qwen-2.5 demonstrate "Early Divergence" by integrating safety mid-computation. Direct comparison with established activation-level steering methods shows that CLS achieves substantially higher attack success rates on Llama 2 (73% vs. 22.6%) and Qwen 7B (91% vs. 79.2%), demonstrating that logit-level intervention exposes alignment vulnerabilities that hidden-state methods underestimate. Beyond attacks, we show that this linearity enables bidirectional control: inverting the steering vector "hardens" models against jailbreaks without retraining. Our findings suggest that current alignment techniques create a steerable "safety axis" that serves as both a critical vulnerability and a precise primitive for defense.
Abstract（参考訳）: 現代の大規模言語モデル (LLMs) は広範囲の安全性に頼っているが、拒絶の機械的基礎はいまだ不透明である。本研究では,安全コンプライアンスが深い意味決定なのか,それとも操作可能な線形機能なのかを検討する。本稿では,安全かつ制約のないシステムプロンプトからの隠蔽状態を対比することにより,"拒絶方向"を分離するゼロ最適化フレームワークであるContrastive Logit Steering(CLS)を紹介する。内部アクティベーションに介入する表現工学手法とは異なり、CRSは出力分布を直接操作し、アライメント脆弱性の診断プローブとして機能する。初期拒絶反射をバイパスするためにプレフィックス注入と結合すると、ガードレールが崩壊する位相遷移を誘導する。 7つのモデルファミリーに関する実験により、安全実装がアーキテクチャ上決定論的であることが判明した。 Llama-3.1のようなモデルは、CRS(約1秒で95%のASRが得られる)によって容易にバイパスされる"Late Decision"トポロジーを示す一方で、Qwen-2.5のようなモデルは、安全性の中間計算を統合することで、"Early Divergence"を示す。確立されたアクティベーションレベルのステアリング法と直接比較すると、LCSはLlama 2 (73%対22.6%) とQwen 7B (91%対79.2%) の攻撃成功率を大幅に向上し、ロジットレベルの介入が隠れ状態のメソッドが過小評価するアライメントの脆弱性を露呈していることが示されている。攻撃以外にも、この線形性は双方向の制御を可能にすることを示します。以上の結果から,現在のアライメント技術は,重要な脆弱性と防御の正確なプリミティブの両方を兼ね備えた,ステアリング可能な「安全軸」を作り出すことが示唆された。

論文の概要: The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

関連論文リスト