Fugu-MT 論文翻訳(概要): Query Circuits: Explaining How Language Models Answer User Prompts

論文の概要: Query Circuits: Explaining How Language Models Answer User Prompts

arxiv url: http://arxiv.org/abs/2509.24808v1
Date: Mon, 29 Sep 2025 13:59:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:20.031942
Title: Query Circuits: Explaining How Language Models Answer User Prompts
Title（参考訳）: クエリ回路: 言語モデルがユーザプロンプトにどう答えるかを説明する
Authors: Tung-Yu Wu, Fazl Barez,
Abstract要約: クエリ回路を導入し、特定の入力を出力にマッピングするモデル内の情報の流れをトレースする。 NDFは、発見回路が特定の入力に対するモデルの判断をいかにうまく回復するかを評価する指標である。モデル内には非常にスパースなクエリ回路が存在し、単一のクエリでその性能を回復できることがわかった。
参考スコア（独自算出の注目度）: 13.16677655895186
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Explaining why a language model produces a particular output requires local, input-level explanations. Existing methods uncover global capability circuits (e.g., indirect object identification), but not why the model answers a specific input query in a particular way. We introduce query circuits, which directly trace the information flow inside a model that maps a specific input to the output. Unlike surrogate-based approaches (e.g., sparse autoencoders), query circuits are identified within the model itself, resulting in more faithful and computationally accessible explanations. To make query circuits practical, we address two challenges. First, we introduce Normalized Deviation Faithfulness (NDF), a robust metric to evaluate how well a discovered circuit recovers the model's decision for a specific input, and is broadly applicable to circuit discovery beyond our setting. Second, we develop sampling-based methods to efficiently identify circuits that are sparse yet faithfully describe the model's behavior. Across benchmarks (IOI, arithmetic, MMLU, and ARC), we find that there exist extremely sparse query circuits within the model that can recover much of its performance on single queries. For example, a circuit covering only 1.3% of model connections can recover about 60% of performance on an MMLU questions. Overall, query circuits provide a step towards faithful, scalable explanations of how language models process individual inputs.
Abstract（参考訳）: 言語モデルが特定の出力を生成する理由を説明するには、ローカルなインプットレベルの説明が必要である。既存の手法は、グローバルな機能回路(例えば、間接的なオブジェクト識別)を明らかにするが、なぜモデルが特定の入力クエリに特定の方法で答えるのかは明らかにしない。クエリ回路を導入し、特定の入力を出力にマッピングするモデル内の情報フローを直接トレースする。サロゲートベースのアプローチ(例えばスパースオートエンコーダ)とは異なり、クエリ回路はモデル自身で識別され、より忠実で計算にアクセスできる説明となる。クエリ回路を実用化するためには,2つの課題に対処する。まず、発見回路が特定の入力に対するモデルの判断をいかにうまく回復するかを評価するための頑健な指標である正規化偏差忠実度(NDF)を導入し、我々の設定を超える回路発見に広く適用できることを示す。第2に,モデルの動作を忠実に記述しつつも疎い回路を効率的に同定するサンプリングベース手法を開発した。ベンチマーク (IOI, 算術, MMLU, ARC) により, モデル内には非常にスパースなクエリ回路が存在し, 単一クエリの性能を回復できることがわかった。例えば、モデル接続の1.3%しかカバーしていない回路は、MMLUの質問で約60%のパフォーマンスを回復することができる。全体として、クエリ回路は、言語モデルが個々の入力をどのように処理するかを忠実でスケーラブルに説明するためのステップを提供する。

論文の概要: Query Circuits: Explaining How Language Models Answer User Prompts

関連論文リスト