Fugu-MT 論文翻訳(概要): Weight-sparse transformers have interpretable circuits

論文の概要: Weight-sparse transformers have interpretable circuits

arxiv url: http://arxiv.org/abs/2511.13653v1
Date: Mon, 17 Nov 2025 18:02:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 18:52:09.658883
Title: Weight-sparse transformers have interpretable circuits
Title（参考訳）: ウェイトスパース変圧器は解釈可能な回路を有する
Authors: Leo Gao, Achyuta Rajaram, Jacob Coxon, Soham V. Govande, Bowen Baker, Dan Mossing,
Abstract要約: 重みのほとんどをゼロに制約することで、より理解可能な回路を持つようにモデルを訓練する。いくつかの手作り作業の根底にあるきめ細かい回路を復元する。我々の研究は、前例のないレベルの人間の理解力を達成する回路を生み出している。
参考スコア（独自算出の注目度）: 4.237686583992518
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Finding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them. We study how these models scale and find that making weights sparser trades off capability for interpretability, and scaling model size improves the capability-interpretability frontier. However, scaling sparse models beyond tens of millions of nonzero parameters while preserving interpretability remains a challenge. In addition to training weight-sparse models de novo, we show preliminary results suggesting our method can also be adapted to explain existing dense models. Our work produces circuits that achieve an unprecedented level of human understandability and validates them with considerable rigor.
Abstract（参考訳）: 言語モデルにおける人間の理解可能な回路を見つけることは、機械的解釈可能性の分野の中心的な目標である。我々は、重みのほとんどをゼロに制限することでより理解可能な回路を持つようにモデルを訓練し、各ニューロンはいくつかの接続しか持たないようにした。いくつかの手作り作業の根底にあるきめ細かい回路を復元するために,その作業に責任のある部分を分離するモデルを試作する。これらの回路は、しばしば自然概念に対応する神経細胞と残留チャネルを含んでおり、それらの間の直接的に解釈可能な接続は少数である。これらのモデルがどのようにスケールするかを調べ、重み付けによって解釈可能性のトレードオフが生じ、スケールモデルのサイズがキャパシティ-解釈可能性のフロンティアを改善することを確かめる。しかし、解釈可能性を維持しながら、スパースモデルを数千万の非ゼロパラメータを超えてスケールすることは依然として課題である。デ・ノボの重量スパースモデルのトレーニングに加えて,本手法が既存の高密度モデルにも適用可能であることを示す予備的な結果を示す。我々の研究は、前例のないレベルの人間の理解性を達成し、それらをかなりの厳密さで検証する回路を生み出している。

論文の概要: Weight-sparse transformers have interpretable circuits

関連論文リスト