Fugu-MT 論文翻訳(概要): When is Your LLM Steerable?

論文の概要: When is Your LLM Steerable?

arxiv url: http://arxiv.org/abs/2606.11599v1
Date: Wed, 10 Jun 2026 02:55:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:38.258109
Title: When is Your LLM Steerable?
Title（参考訳）: LLMはいつステアブルか?
Authors: Chenrui Fan, Yize Cheng, Ming Li, Soheil Feizi, Tianyi Zhou,
Abstract要約: アクティベーションステアリングは、推論時に言語モデルの振る舞いを制御するための軽量なアプローチを提供する。ステアリングを成功させる体制と境界を見つけるには、通常、高価なグリッドサーチが必要である。生成プロセスの開始時にモデルの内部状態からステアビリティを予測できるかどうかを検討する。
参考スコア（独自算出の注目度）: 56.656180566692946
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.
Abstract（参考訳）: アクティベーションステアリングは、推論時に言語モデルの振る舞いを制御するための軽量なアプローチを提供するが、それが成功するか失敗するかは、プロンプト、概念、モデル、およびステアリング構成に大きく依存する。ステアリングを成功させるためには、一般的に高価なグリッドサーチと完全な自己回帰ロールアウトのポストホック評価が必要である。本研究では,生成プロセス開始時のモデルの内部状態,例えば,最初の数個のトークンを生成した後に,ステアビリティを予測できるかどうか,そして,そのような予測器を活用してステアリング成功率を向上させる方法について検討する。この目的のために、まず、1.4万個のステアリング世代を含むテストベッドであるASTEERを紹介し、150のコンセプトに、それぞれのステアリング成功/失敗ラベルをラベル付けした。このテストベッドを活用することで、レイヤ間のステアリング前後の隠れ状態と初期デコードステップを比較した特徴を抽出することで、モデルの早期デコードダイナミクスを分析します。これらの特徴は、ステアリングの効果が層やトークンの位置に沿ってどのように伝播するかを理解するのに役立つ。次に、これらの機能に対してグラディエントブースティング決定木(GBDT)分類器をトレーニングし、完全なロールアウトを必要とせずに、介入がアンダーステア、成功、あるいはオーバーステアになるかどうかを予測する。我々の予測器は、未確認概念に関する0.7マクロF1スコアを達成し、初期隠れ状態が最終的なステアリングの有効性に関する構造化された情報をエンコードしていることを示した。我々はさらに、このステアビリティ予測器をステアリング強度探索のガイダンスとして活用し、最小の復号コストで最適に近い性能を達成する。

論文の概要: When is Your LLM Steerable?

関連論文リスト