Fugu-MT 論文翻訳(概要): Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation

論文の概要: Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation

arxiv url: http://arxiv.org/abs/2606.08682v1
Date: Sun, 07 Jun 2026 15:34:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.388381
Title: Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation
Title（参考訳）: アクティベーションステアリングは創発的ミスアライメントを誘導する:より包括的な評価
Authors: Qi Cao, Jian Lou, Meiting Liu, Wenjie Feng, Dan Li, See-Kiong Ng, Anh Tuan Luu,
Abstract要約: 近年のQwen-3.5シリーズにおいても,アクティベーションステアリングが広範囲のアライメントを引き起こすことが示されている。ステアリングサイズ, ステアリングサブスペースの低ランク構造, ステアリングベクター構築時のエポック数など, キーステアリング固有の因子を解析することにより, AS誘起EMの特性を特徴づける。
参考スコア（独自算出の注目度）: 74.17379276939599
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Activation steering has emerged as a popular inference-time technique for modulating the behavior of large language models (LLMs). By constructing a steering vector from examples of a target behavior and injecting it into intermediate activations during inference, activation steering enables flexible behavioral control while avoiding the permanent parameter updates required by finetuning. Meanwhile, recent work has identified emergent misalignment (EM) as a significant safety concern, wherein models finetuned on unsafe examples from a narrow task may unexpectedly generalize to broadly unsafe behavior on unrelated tasks. Although finetuning-induced EM has been extensively studied, whether activation steering can induce EM remains comparatively under-explored, despite its increasing use as a model-control technique. In this paper, we present a comprehensive study of activation-steering-induced emergent misalignment, substantially expanding the evaluation scope beyond existing pioneering work. First, we show that activation steering can induce broad misalignment, even in the recent Qwen-3.5 series. Moreover, activation-steered models produce harmful responses with stronger semantic relevance and higher coherence than their finetuned counterparts, making the resulting misalignment potentially more harmful. Second, we characterize properties of AS-induced EM by analyzing key steering-specific factors, including steering magnitude, the low-rank structure of the steering subspace, and the number of epochs during steering-vector construction. Third, we evaluate the robustness and sensitivity of AS-induced EM across diverse model families, model scales, target tasks, and intervention layers. Our findings reveal activation steering as a significant yet under-examined source of emergent misalignment and provide an activation-space perspective for understanding the mechanisms and safety risks of EM.
Abstract（参考訳）: アクティベーションステアリングは、大規模言語モデル(LLM)の振る舞いを調節する一般的な推論時間技術として登場した。目標動作の例からステアリングベクトルを構築し、推論中に中間活性化に注入することにより、微調整で必要となるパラメータ更新を回避しながら、アクティベーションステアリングは柔軟な挙動制御を可能にする。一方、近年の研究では、緊急不整合(EM)が重要な安全性上の問題として認識されており、狭いタスクから安全でない例に微調整されたモデルは、予期せず、不適切なタスクに対する広範囲に安全でない振る舞いに一般化される可能性がある。微調整誘起EMは広く研究されているが, モデル制御技術としての利用が増加しているにもかかわらず, 活性化ステアリングがEMを誘導するか否かは, 比較的未探索のままである。本稿では,アクティベーション・ステアリングによる創発的ミスアライメントを包括的に研究し,既存の先駆的作業を超えて評価範囲を大幅に拡大する。まず,最近のQwen-3.5シリーズにおいても,アクティベーションステアリングが広範囲な不整合を引き起こすことを示す。さらに、アクティベーションステアリングモデルは、それらの微調整されたモデルよりも意味的関連性が強く、コヒーレンスが高い有害な応答を生成し、結果として生じるミスアライメントがより有害になる可能性がある。第2に, ステアリングサイズ, ステアリングサブスペースの低ランク構造, ステアリングベクター構築時のエポック数など, キーステアリング固有の要因を解析することにより, AS誘起EMの特性を特徴づける。第3に、多様なモデルファミリー、モデルスケール、目標タスク、介入層におけるAS誘発EMの堅牢性と感度を評価する。本研究は, アクティベーションステアリングを創発的不整合の原因として明らかにし, EMのメカニズムと安全性のリスクを理解するために, アクティベーション空間の視点を提供する。

論文の概要: Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation

関連論文リスト