Fugu-MT 論文翻訳(概要): Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence

論文の概要: Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence

arxiv url: http://arxiv.org/abs/2604.08169v1
Date: Thu, 09 Apr 2026 12:28:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.912148
Title: Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence
Title（参考訳）: コヒーレンスを犠牲にしない配向型オープンエンドジェネレーションの活性化ステアリング
Authors: Niklas Herbster, Martin Zborowski, Alberto Tosato, Gauthier Gidel, Tommaso Tosato,
Abstract要約: ミスアライメントは、敵対的なプロンプト、良心的な微調整、創発的なミスアライメント、目標のミスジェネレーションによって引き起こされる。最近の証拠は、いくつかの不整合挙動が活性化空間の線形構造としてコード化され、操舵によって牽引可能であることを示唆している。これらの知見は, 世代ごとのアクティベーションの不一致を継続的に補正する軽量なランタイムディフェンスとして, アクティベーションステアリングを動機付けている。
参考スコア（独自算出の注目度）: 16.403654360036498
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Alignment in LLMs is more brittle than commonly assumed: misalignment can be triggered by adversarial prompts, benign fine-tuning, emergent misalignment, and goal misgeneralization. Recent evidence suggests that some misalignment behaviors are encoded as linear structure in activation space, making it tractable via steering, while safety alignment has been shown to govern the first few output tokens primarily, leaving subsequent generation unguarded. These findings motivate activation steering as a lightweight runtime defense that continuously corrects misaligned activations throughout generation. We evaluate three methods: Steer-With-Fixed-Coeff (SwFC), which applies uniform additive steering, and two novel projection-aware methods, Steer-to-Target-Projection (StTP) and Steer-to-Mirror-Projection (StMP), that use a logistic regression decision boundary to selectively intervene only on tokens whose activations fall below distributional thresholds. Using malicious system prompts as a controlled proxy for misalignment, we evaluate under two threat models (dishonesty and dismissiveness) and two architectures (Llama-3.3-70B-Instruct, Qwen3-32B). All methods substantially recover target traits (honesty and compassion) while preserving coherence. StTP and StMP better maintain general capabilities (MMLU, MT-Bench, AlpacaEval) and produce less repetition in multi-turn conversations.
Abstract（参考訳）: LLMにおけるアライメントは、一般的に想定されるよりも不安定である: ミスアライメントは、敵のプロンプト、良質な微調整、創発的なミスアライメント、ゴールのミスジェネレーションによって引き起こされる。最近の証拠は、いくつかの不整合挙動が活性化空間の線形構造としてコード化され、ステアリングによって牽引可能であることを示唆しており、一方、安全アライメントは最初の数個の出力トークンを主に支配し、その後の世代は守られていない。これらの知見は, 世代ごとのアクティベーションの不一致を継続的に補正する軽量なランタイムディフェンスとして, アクティベーションステアリングを動機付けている。均一な付加的ステアリングを施すステア・ウィット・フィクスド・コーフ(SwFC)と2つの新しいプロジェクション・アウェア法であるステア・トゥ・ターゲット・プロジェクション(StTP)とステア・トゥ・ミラー・プロジェクション(StMP)の3つの手法を評価し、ロジスティック回帰決定境界を用いて、アクティベーションが分布閾値を下回るトークンのみに選択的に介入する。悪意のあるシステムプロンプトを不正調整の制御プロキシとして使用し、2つの脅威モデル(正直さと否定性)と2つのアーキテクチャ(Llama-3.3-70B-Instruct, Qwen3-32B)で評価する。すべての方法は、コヒーレンスを維持しながら、ターゲット特性(正直さと慈悲)を実質的に回復する。 StTPとStMPは一般機能(MMLU、MT-Bench、AlpacaEval)をよりよく維持し、マルチターン会話においてより少ない繰り返しを生成する。

論文の概要: Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence

関連論文リスト