Fugu-MT 論文翻訳(概要): Learning to Recall with Transformers Beyond Orthogonal Embeddings

論文の概要: Learning to Recall with Transformers Beyond Orthogonal Embeddings

arxiv url: http://arxiv.org/abs/2603.15923v1
Date: Mon, 16 Mar 2026 21:17:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:06.993475
Title: Learning to Recall with Transformers Beyond Orthogonal Embeddings
Title（参考訳）: 直交埋め込みを超えたトランスフォーマーによるリコールの学習
Authors: Nuri Mert Vural, Alberto Bietti, Mahdi Soltanolkotabi, Denny Wu,
Abstract要約: 簡単なトークン検索タスクにおいて,勾配降下法で学習したランダム埋め込みを用いた変圧器を解析する。我々の分析は、勾配降下の初期段階'を追跡分析し、モデル記憶容量の明示的な公式を導出する。
参考スコア（独自算出の注目度）: 42.18876773867171
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern large language models (LLMs) excel at tasks that require storing and retrieving knowledge, such as factual recall and question answering. Transformers are central to this capability because they can encode information during training and retrieve it at inference. Existing theoretical analyses typically study transformers under idealized assumptions such as infinite data or orthogonal embeddings. In realistic settings, however, models are trained on finite datasets with non-orthogonal (random) embeddings. We address this gap by analyzing a single-layer transformer with random embeddings trained with (empirical) gradient descent on a simple token-retrieval task, where the model must identify an informative token within a length-$L$ sequence and learn a one-to-one mapping from tokens to labels. Our analysis tracks the ``early phase'' of gradient descent and yields explicit formulas for the model's storage capacity -- revealing a multiplicative dependence between sample size $N$, embedding dimension $d$, and sequence length $L$. We validate these scalings numerically and further complement them with a lower bound for the underlying statistical problem, demonstrating that this multiplicative scaling is intrinsic under non-orthogonal embeddings.
Abstract（参考訳）: 現代の大規模言語モデル(LLM)は、事実的リコールや質問応答といった知識の保存と検索を必要とするタスクを抽出する。トランスフォーマーは、トレーニング中に情報をエンコードし、推論時にそれを取得できるので、この能力の中心である。既存の理論分析は、典型的には、無限のデータや直交埋め込みのような理想化された仮定の下で変圧器を研究する。しかし現実的な設定では、モデルは非直交(ランダム)埋め込みを持つ有限データセットで訓練される。このギャップに対処するために、単純なトークン検索タスクにおいて、(経験的)勾配勾配で訓練されたランダムな埋め込みを用いて単一層トランスフォーマーを解析し、そこでは、長さ$L$のシーケンス内で情報的トークンを識別し、トークンからラベルへの1対1のマッピングを学習しなければならない。我々の分析では、勾配降下の「初期位相」を追跡し、モデルストレージ容量の明示的な式を出力し、サンプルサイズが$N$、埋め込み次元が$d$、シーケンス長が$L$であることを示す。我々はこれらのスケーリングを数値的に検証し、基礎となる統計問題に対する低い境界で補うことにより、この乗法的スケーリングが非直交埋め込みの下で本質的であることを示す。

論文の概要: Learning to Recall with Transformers Beyond Orthogonal Embeddings

関連論文リスト