Fugu-MT 論文翻訳(概要): LLM Interpretability with Identifiable Temporal-Instantaneous Representation

論文の概要: LLM Interpretability with Identifiable Temporal-Instantaneous Representation

arxiv url: http://arxiv.org/abs/2509.23323v1
Date: Sat, 27 Sep 2025 14:14:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.161137
Title: LLM Interpretability with Identifiable Temporal-Instantaneous Representation
Title（参考訳）: 時間的即時表現によるLLMの解釈可能性
Authors: Xiangchen Song, Jiaqi Sun, Zijian Li, Yujia Zheng, Kun Zhang,
Abstract要約: 本稿では,大規模言語モデルに特化して設計された時間的因果表現学習フレームワークを提案する。提案手法は,実世界の複雑性に合わせてスケールした合成データセットに対して,理論的保証と有効性を示す。
参考スコア（独自算出の注目度）: 18.671694445771113
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite Large Language Models' remarkable capabilities, understanding their internal representations remains challenging. Mechanistic interpretability tools such as sparse autoencoders (SAEs) were developed to extract interpretable features from LLMs but lack temporal dependency modeling, instantaneous relation representation, and more importantly theoretical guarantees, undermining both the theoretical foundations and the practical confidence necessary for subsequent analyses. While causal representation learning (CRL) offers theoretically grounded approaches for uncovering latent concepts, existing methods cannot scale to LLMs' rich conceptual space due to inefficient computation. To bridge the gap, we introduce an identifiable temporal causal representation learning framework specifically designed for LLMs' high-dimensional concept space, capturing both time-delayed and instantaneous causal relations. Our approach provides theoretical guarantees and demonstrates efficacy on synthetic datasets scaled to match real-world complexity. By extending SAE techniques with our temporal causal framework, we successfully discover meaningful concept relationships in LLM activations. Our findings show that modeling both temporal and instantaneous conceptual relationships advances the interpretability of LLMs.
Abstract（参考訳）: 大きな言語モデルの優れた能力にもかかわらず、内部表現を理解することは依然として困難である。スパースオートエンコーダ (SAE) などの機械的解釈可能性ツールを開発し, LLM から解釈可能な特徴を抽出するが, 時間的依存モデリング, 即時関係表現, より重要な理論的保証を欠いている。因果表現学習(CRL)は、潜在概念を明らかにするための理論的に基礎的なアプローチを提供するが、既存の手法では非効率な計算のためにLLMの豊富な概念空間にスケールできない。このギャップを埋めるために,LLMの高次元概念空間に特化して設計された時間的因果表現学習フレームワークを導入する。提案手法は,実世界の複雑性に合わせてスケールした合成データセットに対して,理論的保証と有効性を示す。 SAE手法を時間的因果関係で拡張することにより,LLMアクティベーションにおける意味ある概念関係の発見に成功した。その結果, 時間的・即時的な概念的関係のモデル化がLLMの解釈可能性を向上させることが示唆された。

論文の概要: LLM Interpretability with Identifiable Temporal-Instantaneous Representation

関連論文リスト