Fugu-MT 論文翻訳(概要): Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

論文の概要: Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

arxiv url: http://arxiv.org/abs/2605.21849v1
Date: Thu, 21 May 2026 00:46:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.039261
Title: Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift
Title（参考訳）: 分布シフト下における忠実辞書に基づく解釈可能性のための幾何学的適応的説明器
Authors: Sungjun Lim, Heedong Kim, Andrew Lee, Kyungwoo Song,
Abstract要約: 分布シフトは、モデルが積極的に使用する部分空間を回転させ、イン・ディストリビューション(ID)アクティベーションに基づいて訓練された説明者の辞書を誤ることを示す。我々は,このミスアライメントを,ID辞書とOOD活性部分空間との間の幾何学的距離である忠実度ギャップとして定式化する。提案するGeometry-Adaptive Explainer (GAE, Geometry-Adaptive Explainer) は,従来の特徴構造を保ちながら,OOD-active 部分空間で説明者の辞書を実現する。
参考スコア（独自算出の注目度）: 17.611062308867275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mechanistic interpretability aims to explain a model's behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfulness under out-of-distribution (OOD) shift has received little systematic attention. We show that distribution shift rotates the subspace that the model actively uses, misaligning the explainer's dictionary trained on in-distribution (ID) activations. We formalize this misalignment as the faithfulness gap, a geometric distance between the ID dictionary and the OOD-active subspace, and show that it controls OOD faithfulness degradation. To reduce this gap, we propose the Geometry-Adaptive Explainer (GAE), which realigns the explainer's dictionary with the OOD-active subspace while preserving the original feature structure. This requires only unlabeled OOD activations and no gradient updates. We prove that GAE improves over the unadapted ID explainer, with excess loss bounded quadratically by the second-moment shift. Empirically, GAE even matches or surpasses all training-based baselines in causal faithfulness across multiple models and OOD settings.
Abstract（参考訳）: 機械的解釈可能性(Mechanistic Interpretability)は、因果的に責任を負う内部構造を特定することによってモデルの振る舞いを説明することを目的としている。スパースオートエンコーダやトランスコーダのような辞書ベースの説明器は主要なツールであるが、アウト・オブ・ディストリビューション(OOD)シフト下での彼らの忠実さは、体系的にはほとんど注目されていない。分布シフトは、モデルが積極的に使用する部分空間を回転させ、イン・ディストリビューション(ID)アクティベーションに基づいて訓練された説明者の辞書を誤ることを示す。我々は,このミスアライメントを,ID辞書とOOD活性部分空間との間の幾何学的距離である忠実度ギャップとして定式化し,OOD忠実度劣化を制御することを示す。このギャップを小さくするため,元の特徴構造を保ちながら,説明者の辞書をOOD-active subspaceで表現するGeometry-Adaptive Explainer (GAE)を提案する。これにより、ラベルなしのOODアクティベートと、勾配更新が不要になる。我々はGAEが非適応ID説明器よりも改善し、第二モーメントシフトによって余剰損失が四分の一に制限されていることを証明した。実証的には、GAEは複数のモデルとOOD設定にまたがる因果的忠実さにおいて、トレーニングベースのベースラインをすべて一致または超えている。

論文の概要: Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

関連論文リスト