Fugu-MT 論文翻訳(概要): Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer

論文の概要: Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer

arxiv url: http://arxiv.org/abs/2506.01115v1
Date: Sun, 01 Jun 2025 18:42:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-04 21:47:33.93953
Title: Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer
Title（参考訳）: Attention Retrieves, MLP Memorizes: Transformerにおけるトレーニング可能なコンポーネントの分離
Authors: Yihe Dong, Lorenzo Noci, Mikhail Khodak, Mufan Li,
Abstract要約: Transformerアーキテクチャは、現代の大規模言語モデルの成功の中心である。 Transformerのコアコンポーネントは自己アテンションメカニズムですが、パフォーマンス向上のどの面、どの面がそれに起因するのか疑問に思っています。
参考スコア（独自算出の注目度）: 19.36946128510059
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Transformer architecture is central to the success of modern Large Language Models (LLMs), in part due to its surprising ability to perform a wide range of algorithmic tasks -- including mathematical reasoning, memorization, and retrieval -- using only gradient-based training on next-token prediction. While the core component of a Transformer is the self-attention mechanism, we question how much, and which aspects, of the performance gains can be attributed to it. To this end, we compare standard Transformers to variants in which either the multi-layer perceptron (MLP) layers or the attention projectors (queries and keys) are frozen at initialization. To further isolate the contribution of attention, we introduce MixiT -- the Mixing Transformer -- a simplified, principled model in which the attention coefficients are entirely random and fixed at initialization, eliminating any input-dependent computation or learning in attention. Surprisingly, we find that MixiT matches the performance of fully trained Transformers on various algorithmic tasks, especially those involving basic arithmetic or focusing heavily on memorization. For retrieval-based tasks, we observe that having input-dependent attention coefficients is consistently beneficial, while MixiT underperforms. We attribute this failure to its inability to form specialized circuits such as induction heads -- a specific circuit known to be crucial for learning and exploiting repeating patterns in input sequences. Even more interestingly, we find that attention with frozen key and query projectors is not only able to form induction heads, but can also perform competitively on language modeling. Our results underscore the importance of architectural heterogeneity, where distinct components contribute complementary inductive biases crucial for solving different classes of tasks.
Abstract（参考訳）: Transformerアーキテクチャは、最近のLarge Language Models (LLMs) の成功の中心であり、その一部は、数学的な推論、記憶、検索を含む広範囲のアルゴリズムタスクを実行する驚くべき能力のためである。 Transformerのコアコンポーネントは自己アテンションメカニズムですが、パフォーマンス向上のどの面、どの面がそれに起因するのか疑問に思っています。この目的のために,マルチ層パーセプトロン (MLP) 層とアテンションプロジェクタ (クエリとキー) が初期化時に凍結される変種との比較を行った。注意の寄与をさらに分離するために、MixiT -- Mixing Transformer -- 注意係数が完全にランダムで初期化時に固定され、入力依存の計算や注意の学習が不要となる単純化された基本モデルを導入する。意外なことに、MixiTは様々なアルゴリズムタスク、特に基本的な算術や暗記に重きを置いているタスクにおいて、完全に訓練されたトランスフォーマーの性能と一致している。検索に基づくタスクでは、MixiTは性能が劣る一方、入出力依存の注意係数が常に有益であることが観察される。この障害は、入力シーケンスにおける繰り返しパターンの学習と活用に不可欠であることが知られている特定の回路であるインダクションヘッドのような特殊な回路を形成することができないためである。さらに興味深いことに、フリーズキーとクエリプロジェクタによる注意は、誘導ヘッドを形成するだけでなく、言語モデリングに競争力を持たせることができる。この結果から,異なるタスクのクラスを解く上で重要な相補的帰納的バイアスに,異なるコンポーネントが寄与するアーキテクチャ的不均一性の重要性が浮き彫りになった。

論文の概要: Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer

関連論文リスト