Fugu-MT 論文翻訳(概要): Angular Steering: Behavior Control via Rotation in Activation Space

論文の概要: Angular Steering: Behavior Control via Rotation in Activation Space

arxiv url: http://arxiv.org/abs/2510.26243v1
Date: Thu, 30 Oct 2025 08:23:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-31 16:05:09.710281
Title: Angular Steering: Behavior Control via Rotation in Activation Space
Title（参考訳）: Angular Steering: アクティベーション空間における回転による動作制御
Authors: Hieu M. Vu, Tan M. Nguyen,
Abstract要約: Angular Steeringは、振る舞い変調の新しいフレキシブルな方法である。固定された2次元部分空間内で回転活性化によって作用する。拒否やコンプライアンスといった行動に対して,継続的かつきめ細かいコントロールを提供する。
参考スコア（独自算出の注目度）: 1.3400719989424488
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Controlling specific behaviors in large language models while preserving their general capabilities is a central challenge for safe and reliable artificial intelligence deployment. Current steering methods, such as vector addition and directional ablation, are constrained within a two-dimensional subspace defined by the activation and feature direction, making them sensitive to chosen parameters and potentially affecting unrelated features due to unintended interactions in activation space. We introduce Angular Steering, a novel and flexible method for behavior modulation that operates by rotating activations within a fixed two-dimensional subspace. By formulating steering as a geometric rotation toward or away from a target behavior direction, Angular Steering provides continuous, fine-grained control over behaviors such as refusal and compliance. We demonstrate this method using refusal steering emotion steering as use cases. Additionally, we propose Adaptive Angular Steering, a selective variant that rotates only activations aligned with the target feature, further enhancing stability and coherence. Angular Steering generalizes existing addition and orthogonalization techniques under a unified geometric rotation framework, simplifying parameter selection and maintaining model stability across a broader range of adjustments. Experiments across multiple model families and sizes show that Angular Steering achieves robust behavioral control while maintaining general language modeling performance, underscoring its flexibility, generalization, and robustness compared to prior approaches. Code and artifacts are available at https://github.com/lone17/angular-steering/.
Abstract（参考訳）: 一般的な能力を維持しながら、大きな言語モデルで特定の振る舞いを制御することは、安全で信頼性の高い人工知能デプロイメントにおける中心的な課題である。ベクトル加算や指向性アブレーションのような現在のステアリング法は、アクティベーションと特徴方向によって定義された2次元のサブ空間内で制約され、選択されたパラメータに敏感になり、アクティベーション空間における意図しない相互作用による無関係な特徴に影響を与える可能性がある。固定された2次元部分空間内での回転活性化によって動作を調節する,新しいフレキシブルな動作変調手法であるAngular Steeringを紹介する。 Angular Steeringは、ステアリングを幾何学的回転としてターゲットの動作方向から遠ざかることによって、拒絶やコンプライアンスといった動作に対する連続的かつきめ細かい制御を提供する。本稿では,この手法をユースケースとして,拒絶操舵感情ステアリングを用いて実証する。さらに、ターゲット機能に一致したアクティベーションのみを回転させる選択型であるAdaptive Angular Steeringを提案し、安定性とコヒーレンスをさらに向上させる。 Angular Steeringは、統一的な幾何回転フレームワークの下で既存の追加および直交化技術を一般化し、パラメータの選択を簡素化し、幅広い調整範囲にわたってモデルの安定性を維持する。複数のモデルファミリとサイズにわたる実験によると、Angular Steeringは、一般的な言語モデリングのパフォーマンスを維持しながら、堅牢な動作制御を実現し、その柔軟性、一般化、ロバストさを以前のアプローチと比べて強調している。コードとアーティファクトはhttps://github.com/lone17/angular-steering/で入手できる。

論文の概要: Angular Steering: Behavior Control via Rotation in Activation Space

関連論文リスト