Fugu-MT 論文翻訳(概要): Sparsely gated tiny linear experts

論文の概要: Sparsely gated tiny linear experts

arxiv url: http://arxiv.org/abs/2606.07414v1
Date: Fri, 05 Jun 2026 16:06:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.839029
Title: Sparsely gated tiny linear experts
Title（参考訳）: 小口径リニアエキスパート
Authors: Simon Schug,
Abstract要約: スパーシティは、計算コストを比例的に増加させることなく、モデルのパラメータをスケーリングすることを可能にする。それぞれの専門家を1つのニューロンに縮小することで、より疎結合性を高め、計算効率と解釈可能性を向上させることができることを示す。どちらも達成するための鍵は、専門家に通常適用される非線形性を除去することであり、その結果、疎ゲートの線形ニューロンのネットワークが形成される。
参考スコア（独自算出の注目度）: 4.080473990569987
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Sparsity allows scaling model parameters without proportionally increasing computational cost. While mixture of experts (MoE) models are made increasingly sparse, individual experts typically remain large and dense. Here, we demonstrate that further increasing sparsity by shrinking each expert to consist of a single neuron and selecting a tiny fraction of many available neurons can improve compute efficiency and interpretability. Counterintuitively, the key to achieving both is removing the nonlinearity typically applied to the experts, resulting in a network of sparsely gated linear neurons (sgatlin). In an isoflop comparison, we find that replacing all transformer feedforward layers with sgatlin improves perplexity in language models across different compute budgets. At the same time, the sparsity and linearity of the resulting feedforward circuits present new opportunities for model interpretability. In a small-scale case study, we demonstrate that feedforward circuits in sgatlin can be interpreted without having to train additional replacement models. We find that they form semantically structured clusters and are causally implicated in factual recall. Our findings paint a possible path towards compute-efficient and interpretable transformer feedforward layers.
Abstract（参考訳）: スパーシティは、計算コストを比例的に増加させることなく、モデルのパラメータをスケーリングすることを可能にする。専門家の混合モデル(MoE)はますます疎遠になっているが、個々の専門家は通常、大きくて密度が高いままである。ここでは、各専門家を1つのニューロンに縮小し、利用可能なニューロンのごく一部を選択することで、計算効率と解釈可能性を向上させることによる、さらなる疎外性の向上を実証する。反対に、両方を達成するための鍵は、専門家に通常適用される非線形性を取り除くことであり、結果として、疎ゲートされた線形ニューロン(sgatlin)のネットワークが形成される。アイソフロップ比較では、全てのトランスフォーマーフィードフォワード層をスガトリンに置き換えることで、異なる計算予算にわたる言語モデルの難易度が向上することがわかった。同時に、結果として生じるフィードフォワード回路の幅と線形性は、モデル解釈可能性の新しい機会を提供する。小型ケーススタディにおいて、スガトリンのフィードフォワード回路は、追加の代替モデルを訓練することなく解釈できることを実証した。意味的に構造化されたクラスタを形成し、事実的リコールに因果的に関係していることがわかった。以上の結果から, 計算効率が高く, 解釈可能なトランスフォーワード層への経路が示唆された。

論文の概要: Sparsely gated tiny linear experts

関連論文リスト