Fugu-MT 論文翻訳(概要): Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

論文の概要: Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

arxiv url: http://arxiv.org/abs/2605.20948v1
Date: Wed, 20 May 2026 09:35:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.600947
Title: Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory
Title（参考訳）: メモリグラフト:オフライン条件記憶による言語モデルの事前学習
Authors: Runxi Cheng, Yuchen Guan, Yongxian Wei, Qianpu Sun, Qixiu Li, Sinan Du, Feng Xiong, Chun Yuan, Yan Lu, Yeyun Gong,
Abstract要約: 条件付きメモリのスケーリングは、言語モデルのキャパシティを向上する有望な方法である。 Engramのような既存の方法は、事前トレーニング中にスクラッチから大きなメモリテーブルを学習する。本研究では, グラフトモデルから凍結した隠蔽状態を条件n-gramメモリとして利用する条件記憶スケーリング手法であるメモリグラフトを提案する。
参考スコア（独自算出の注目度）: 65.39827296429527
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands external latent capacity with limited training and inference overhead. Experiments under matched recipient architectures and pre-training budgets show that Memory Grafting improves over both MoE and vanilla Engram baselines. In the 2.8B-scale setting, it improves the average benchmark score from 51.95 for MoE and 52.43 for vanilla Engram to 53.86. In the 0.92B-scale setting, all grafting-model variants improve over the baselines, with Qwen3.5-35B-A3B giving the strongest gains. These results suggest that pretrained models can serve as reusable constructors of external latent memory, providing a practical step toward scaling future language models beyond trainable parameters alone.
Abstract（参考訳）: 条件付きメモリのスケーリングは、言語モデルのキャパシティを向上する有望な方法だが、Engramのような既存のメソッドは、事前トレーニング中にスクラッチから大きなメモリテーブルを学習し、メモリスケーリングを高価にし、時には非効率にする。本研究では, グラフトモデルから凍結した隠蔽状態を条件n-gramメモリとして利用する条件記憶スケーリング手法であるメモリグラフトを提案する。ローカルn-gramの頻度を前提として、グラフトモデルをオフラインで実行し、最終トーケンで隠された表現をメモリ値として保存し、受信側モデルに最長のsuffixルックアップを通じてそれらを検索させる。検索されたメモリは、軽量なプロジェクションとゲートによって適応され、ハッシュベースのEngramフォールバックは、未整合コンテキストのカバレッジを保存する。グラフトモデルはオフラインでのみ実行され、メモリバンクサイズに関して正確なルックアップはO(1)複雑さを期待しているため、メモリグラフトはトレーニングと推論のオーバーヘッドを制限して外部潜在能力を拡張する。マッチした受信アーキテクチャと事前トレーニングの予算の下での実験では、メモリグラフトはMoEとバニラ・エングラムのベースラインよりも改善されている。 2.8Bスケールでは、平均ベンチマークスコアはMoEが51.95、バニラ・エングラムが52.43から53.86に改善された。 0.92Bスケールでは、すべてのグラフトモデルがベースラインよりも改善され、Qwen3.5-35B-A3Bが最強のゲインとなった。これらの結果は、事前学習されたモデルは、外部潜在メモリの再利用可能なコンストラクタとして機能し、トレーニング可能なパラメータだけでなく、将来の言語モデルをスケールするための実践的なステップとなることを示唆している。

論文の概要: Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

関連論文リスト