Fugu-MT 論文翻訳(概要): The Cylindrical Representation Hypothesis for Language Model Steering

論文の概要: The Cylindrical Representation Hypothesis for Language Model Steering

arxiv url: http://arxiv.org/abs/2605.01844v1
Date: Sun, 03 May 2026 12:26:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.959854
Title: The Cylindrical Representation Hypothesis for Language Model Steering
Title（参考訳）: 言語モデルステアリングのための円筒表現仮説
Authors: Lang Gao, Jinghui Zhang, Wei Liu, Fengxian Ji, Chenxi Wang, Zirui Song, Akash Ghosh, Youssef Mohamed, Preslav Nakov, Xiuying Chen,
Abstract要約: 中心軸は、概念の欠如と存在との主な違いを捉え、概念生成を駆動することを示す。我々はこれを円筒表現仮説(CRH)として定式化する。本実験は円筒構造の存在を検証し,CRHが実環境でのモデルステアリング動作の解釈に有効かつ実用的な方法であることを示した。
参考スコア（独自算出の注目度）: 57.97381760521523
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH's orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a sample-specific axis-orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH). In CRH, a central axis captures the main difference between concept absence and presence and drives concept generation. A surrounding normal plane controls steering sensitivity by determining how easily the axis can activate the target concept. Within this plane, only specific sensitive sectors strongly facilitate concept activation, while other sectors can suppress or delay it. While the surrounding normal plane can be reliably identified from difference vectors, the sensitive sector cannot, introducing intrinsic uncertainty at the sector level. This uncertainty provides a principled explanation for why steering outcomes often fluctuate even when using well-aligned directions. Our experiments verify the existence of the cylindrical structure and demonstrate that CRH provides a valid and practical way to interpret model steering behavior in real settings: https://github.com/mbzuai-nlp/CRH.
Abstract（参考訳）: ステアリングは大きな言語モデルを制御するために広く使われているテクニックであるが、その効果は不安定で予測が難しいことが多い。既存の理論的な説明は主にリニア表現仮説(LRH)に基づいている。 LRHは、概念は損失のない制御のために直交化できると仮定するが、この理想化された写像は実表現では失敗し、ステアリングの観測された予測不可能を考慮できない。線形表現を保ちながらLRHの直交性仮定を緩和することにより、重なり合う概念が自然にサンプル固有の軸-直交構造をもたらすことを示す。我々はこれをCylindrical Representation hypothesis (CRH)として定式化する。 CRHでは、中心軸は概念の不在と存在の主な違いを捉え、概念生成を駆動する。周囲の通常の平面は、軸が目標概念をどの程度容易に活性化できるかを決定することによって操舵感度を制御する。この飛行機内では、特定の敏感なセクターだけがコンセプトアクティベーションを強く促進し、他のセクターはそれを抑制または遅らせることができる。周囲の通常の平面は、差分ベクトルから確実に特定できるが、感度セクターは、セクターレベルで本質的な不確実性を導入することはできない。この不確実性は、うまく整列した方向を用いた場合であっても、なぜステアリング結果が頻繁に変動するかという原則的な説明を提供する。本実験は, 円筒構造の存在を検証し, CRHが実環境でモデルステアリングの挙動を解釈する有効な方法であることを示した。

論文の概要: The Cylindrical Representation Hypothesis for Language Model Steering

関連論文リスト