Fugu-MT 論文翻訳(概要): ICA Lens: Interpreting Language Models Without Training Another Dictionary

論文の概要: ICA Lens: Interpreting Language Models Without Training Another Dictionary

arxiv url: http://arxiv.org/abs/2606.11722v1
Date: Wed, 10 Jun 2026 06:53:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:38.331623
Title: ICA Lens: Interpreting Language Models Without Training Another Dictionary
Title（参考訳）: ICA Lens: 他の辞書を訓練せずに言語モデルを解釈する
Authors: Sida Liu, Feijiang Han,
Abstract要約: 独立成分分析(ICA)は、言語モデル表現において非ガウス的方向を求める古典的な方法である。本稿では,言語モデル表現の安定,効率的,監査可能なICA分析のための最初の実践的ワークフローであるICALensを紹介する。 ICAはスパース・プローブで公共のSAEと競争しており、小規模から中小の予算の下でターゲットプローブでそれらを上回ります。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.
Abstract（参考訳）: 言語モデル表現における解釈可能な方向を見つけることは、モデルの振る舞いを理解し制御するために重要である。スパースオートエンコーダ(SAE)がこの目的の標準ツールとなっているが、デフォルトのファーストレンズとして使用するには、大規模なオーバーコンプリート辞書のトレーニング、保存、評価が必要となることが多い。このボトルネックは迅速な探索を制限し、基本的な疑問を提起する。他の神経辞書を訓練する前に、アクティベーション幾何学から解釈可能な構造がすでにどの程度見えているのか? 我々の直観は単純で、多くの解釈可能な方向はトークンに対して選択的であり、これらの方向はランダムな方向よりもガウス的に見える。そこで我々は,非ガウス的方向を求める古典的手法である独立成分分析(ICA)を,言語モデル解釈性のためのコンパクトレンズとして再検討する。 LLMのアクティベーションが不安定で, 回収方向を検査・評価するための体系的ツールが欠如しているため, ICAはLLMの解釈可能性において過小評価されている。これらのギャップを埋めるために,我々は,LCM表現の安定,効率的,監査可能なICA解析のための最初の実践的ワークフローであるICALensを紹介した。最適化されたGPU並列FastICAパイプラインとLLM固有の安定性レシピと、より良い適合診断を組み合わせることで、効率的で信頼性の高いレイヤワイズ分析を実現している。 GPT-2 Small, Gemma 2 2B, Qwen 3.5 2B Base にまたがって、Cicalens は階層ごとの勾配に基づく辞書トレーニングをすることなく、コンパクトで人間の解釈可能な方向を効率的に回復する。 SAEBenchでは、ICAはスパースプローブの公的なSAEと競争し、小規模から中規模予算の下でターゲットプローブの摂動においてそれらを上回ります。これらの結果はICAを弱いベースラインと見なすべきではなく、言語モデル表現を探索するための効率的かつ補完的な第1レンズと見なすべきであることを示している。

論文の概要: ICA Lens: Interpreting Language Models Without Training Another Dictionary

関連論文リスト