Fugu-MT 論文翻訳(概要): Priors in Time: Missing Inductive Biases for Language Model Interpretability

論文の概要: Priors in Time: Missing Inductive Biases for Language Model Interpretability

arxiv url: http://arxiv.org/abs/2511.01836v1
Date: Mon, 03 Nov 2025 18:43:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:27.374663
Title: Priors in Time: Missing Inductive Biases for Language Model Interpretability
Title（参考訳）: Pres in Time: 言語モデル解釈容易性のためのインダクティブバイアスの欠如
Authors: Ekdeep Singh Lubana, Can Rager, Sai Sumedh R. Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J. Bigelow, Johnny Lin, Demba Ba, Martin Wattenberg, Fernanda Viegas, Melanie Weber, Aaron Mueller,
Abstract要約: スパースオートエンコーダは、時間とともに概念の独立を前提としており、定常性を暗示している。本稿では,時間的帰納バイアスを持つ新たな解釈可能性目標である時間的特徴分析を導入し,その表現を2つの部分に分解する。私たちの結果は、堅牢な解釈可能性ツールの設計において、データにマッチする帰納的バイアスの必要性を浮き彫りにしています。
参考スコア（独自算出の注目度）: 58.07412640266836
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions, it is unclear if this assumption can capture the rich temporal structure of language. Specifically, via a Bayesian lens, we demonstrate that Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time, implying stationarity. Meanwhile, language model representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. Taking inspiration from computational neuroscience, we introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information unexplained by the context. Temporal Feature Analyzers correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Overall, our results underscore the need for inductive biases that match the data in designing robust interpretability tools.
Abstract（参考訳）: 言語モデルのアクティベーションから意味のある概念を復元することは、解釈可能性の中心的な目的である。既存の特徴抽出手法は、独立方向の概念を識別することを目的としているが、この仮定が言語の豊富な時間構造を捉えることができるかどうかは不明である。具体的には、ベイズレンズを通して、スパースオートエンコーダ(SAEs)が時間の経過とともに概念の独立性を前提とした事前を課していることを示し、定常性を示唆する。一方、言語モデル表現は、概念的次元性の体系的な成長、文脈依存的相関、SAEの先行と直接競合する非定常性など、豊かな時間的ダイナミクスを示す。計算神経科学からインスピレーションを得て,時間的インダクティブなバイアスを持つ時間的特徴分析(Temporal Feature Analysis)を導入し,文脈から推測可能な予測可能な成分と,文脈によって説明されていない新規情報をキャプチャする残留成分の2つに分けた。時間的特徴分析器は、庭の道の文を正しく解析し、イベント境界を識別し、より広い範囲で抽象的でゆっくり動く情報を、新しい、素早く動く情報から切り離し、既存のSAEは上記の全てのタスクに重大な落とし穴を示す。私たちの結果は、堅牢な解釈可能性ツールを設計する上で、データにマッチする帰納的バイアスの必要性を浮き彫りにしています。

論文の概要: Priors in Time: Missing Inductive Biases for Language Model Interpretability

関連論文リスト