Fugu-MT 論文翻訳(概要): Linear representations in language models can change dramatically over a conversation

論文の概要: Linear representations in language models can change dramatically over a conversation

arxiv url: http://arxiv.org/abs/2601.20834v2
Date: Mon, 02 Feb 2026 21:30:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-04 16:18:58.803642
Title: Linear representations in language models can change dramatically over a conversation
Title（参考訳）: 言語モデルにおける線形表現は会話中に劇的に変化する
Authors: Andrew Kyle Lampinen, Yuxuan Li, Eghbal Hosseini, Sangnie Bhardwaj, Murray Shanahan,
Abstract要約: 言語モデル表現は高次概念に対応する線形方向を含むことが多い。線形表現は会話中に劇的に変化する。また, 表象方向の操舵は, 会話の異なる点において, 劇的に異なる効果を持つことを示す。
参考スコア（独自算出の注目度）: 12.34627880378922
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering -- in particular, they imply that it may be misleading to use static interpretations of features or directions, or probes that assume a particular range of features consistently corresponds to a particular ground-truth value. However, these types of representational dynamics also point to exciting new research directions for understanding how models adapt to context.
Abstract（参考訳）: 言語モデル表現は高次概念に対応する線形方向を含むことが多い。ここでは、これらの表現のダイナミクスについて研究する: 表現は、(シミュレートされた)会話の文脈内で、これらの次元に沿ってどのように進化するか。例えば、会話の開始時に事実として表現される情報は、最後には非事実として表現され、その逆も表現される。これらの変更は内容に依存し、会話関連情報の表現は変更されるが、一般的な情報は一般に保存される。これらの変化は、より表面的な応答パターンから事実性を遠ざけ、モデルの異なるモデルファミリや層にまたがって起こる次元に対しても堅牢である。これらの表現の変更は、政治上の会話を必要としない。全く異なるモデルで書かれた会話スクリプトを再生しても、同様の変更が生じる。しかし、適応は、より明確にフレーム化されている文脈において、単にSFストーリーを持つというよりは、はるかに弱い。また, 表象方向の操舵は, 会話の異なる点において, 劇的に異なる効果を持つことを示す。これらの結果は、表現が会話によって導かれる特定の役割を演じるモデルに応答して進化するという考えと一致している。特に、特徴や方向の静的な解釈を使うことが誤解を招きかねないことや、特定の特徴の特定の範囲が一定の基底真理値に一貫して一致すると仮定するプローブがあることを示唆している。しかし、このような表現力学は、モデルが文脈にどのように適応するかを理解するための新しい研究の方向性を示唆している。

論文の概要: Linear representations in language models can change dramatically over a conversation

関連論文リスト