Fugu-MT 論文翻訳(概要): Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention

論文の概要: Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention

arxiv url: http://arxiv.org/abs/2510.19875v1
Date: Wed, 22 Oct 2025 09:42:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:16.432531
Title: Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention
Title（参考訳）: ストリーム: スパースアテンションによるLLMの長期コンテキストへの機械的解釈可能性のスケールアップ
Authors: J Rosser, José Luis Redondo García, Gustavo Penha, Konstantina Palla, Hugues Bouchard,
Abstract要約: Sparse Tracingは、ダイナミックなスパースアテンションを利用して、長時間のコンテキストアテンションパターンを効率的に分析する手法である。ほぼ直線時間で,頭部ごとのスパークアテンションマスクを推定する,コンパイル可能な階層型プルーニングアルゴリズムであるStreamを提案する。本手法は, テラバイトのキャッシュを使わずに, 注意パターンを解析し, 情報の流れをトレースする実用的なドロップインツールを提供する。
参考スコア（独自算出の注目度）: 1.5866317687968634
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As Large Language Models (LLMs) scale to million-token contexts, traditional Mechanistic Interpretability techniques for analyzing attention scale quadratically with context length, demanding terabytes of memory beyond 100,000 tokens. We introduce Sparse Tracing, a novel technique that leverages dynamic sparse attention to efficiently analyze long context attention patterns. We present Stream, a compilable hierarchical pruning algorithm that estimates per-head sparse attention masks in near-linear time $O(T \log T)$ and linear space $O(T)$, enabling one-pass interpretability at scale. Stream performs a binary-search-style refinement to retain only the top-$k$ key blocks per query while preserving the model's next-token behavior. We apply Stream to long chain-of-thought reasoning traces and identify thought anchors while pruning 97-99\% of token interactions. On the RULER benchmark, Stream preserves critical retrieval paths while discarding 90-96\% of interactions and exposes layer-wise routes from the needle to output. Our method offers a practical drop-in tool for analyzing attention patterns and tracing information flow without terabytes of caches. By making long context interpretability feasible on consumer GPUs, Sparse Tracing helps democratize chain-of-thought monitoring. Code is available at https://anonymous.4open.science/r/stream-03B8/.
Abstract（参考訳）: LLM(Large Language Models)が100万のコンテキストにスケールするにつれて、従来の機械的解釈可能性(Mechanistic Interpretability)技術は、注意スケールをコンテキスト長で2次的に分析し、10万のトークンを超えるテラバイトのメモリを必要とする。 Sparse Tracingは、ダイナミックなスパースアテンションを利用して、長時間のコンテキストアテンションパターンを効率的に分析する新しい手法である。本稿では, 線形空間$O(T \log T)$および線形空間$O(T)$を用いて, 頭部のスパースマスマスマスをほぼ直線的に推定し, 大規模にワンパスの解釈を可能にする, コンパイル可能な階層的プルーニングアルゴリズムStreamを提案する。 Streamはバイナリ検索スタイルのリファインメントを実行し、クエリごとのトップ$k$キーブロックだけを保持すると同時に、モデルの次のトーケン動作を保存する。 Streamを長いチェーンの推論トレースに適用し、97-99\%のトークンインタラクションを実行しながら、思考アンカーを特定します。 RULERベンチマークでは、ストリームは90～96パーセントのインタラクションを破棄しながら重要な検索パスを保持し、ニードルから出力への階層的なルートを公開する。本手法は,1テラバイトのキャッシュを使わずに,注意パターンを解析し,情報の流れをトレースする実用的なドロップインツールを提供する。コンシューマGPU上で長時間のコンテキスト解釈を可能にすることによって、Sparse Tracingは、チェーンオブ思考監視の民主化を支援する。コードはhttps://anonymous.4open.science/r/stream-03B8/で入手できる。

論文の概要: Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention

関連論文リスト