Fugu-MT 論文翻訳(概要): Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

論文の概要: Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

arxiv url: http://arxiv.org/abs/2506.05229v1
Date: Thu, 05 Jun 2025 16:43:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-06 21:53:49.829507
Title: Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts
Title（参考訳）: 長期記憶用リカレントメモリ変換器における対角バッチの並列化
Authors: Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets,
Abstract要約: トランスフォーマーモデルは、2次時間と線形メモリの複雑さのために、長いコンテキスト推論に苦しむ。リカレントメモリ(RMT)は、コストの線形時間とメモリ使用量の一定を削減してソリューションを提供する。しかし、メモリ更新メカニズムがシーケンシャルな実行を引き起こし、パフォーマンスのボトルネックが発生します。本稿では,RTTのセグメント間の並列性を正確に保ちつつ,並列性を解放するスケジューリング手法であるDiagonalを紹介する。
参考スコア（独自算出の注目度）: 5.585952216289788
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining. Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential RMT implementation on 131,072-token sequences. By removing sequential bottleneck, Diagonal Batching reduces inference cost and latency, thereby strengthening RMTs as a practical solution for real-world, long-context applications.
Abstract（参考訳）: トランスフォーマーモデルは、2次時間と線形メモリの複雑さのために、長いコンテキスト推論に苦しむ。リカレントメモリトランス (Recurrent Memory Transformer, RMT) は、漸近的なコストを線形時間と一定メモリ使用量に削減することで、ソリューションを提供する。しかし、メモリ更新メカニズムはシーケンシャルな実行をもたらし、パフォーマンスのボトルネックを引き起こす。本稿では, RMT のセグメント間の並列性を正確に保ちつつ, 並列性を解放するスケジューリング手法である Diagonal Batching を紹介する。このアプローチはシーケンシャルな制約を排除し、複雑なバッチ処理やパイプライニングのテクニックを使わずに、単一の長コンテキスト入力に対しても効率的なGPU推論を可能にする。この手法は純粋に実行時の計算再順序付けであるため、既存のRTTモデルでは再トレーニングなしで採用されている。 LLaMA-1B ARMTモデルに適用すると、Diagonal Batching は標準のフルアテンション LLaMA-1B よりも3.3倍のスピードアップ、131,072 塩基配列上のシーケンシャル RMT 実装より1.8倍のスピードアップが得られる。シーケンシャルなボトルネックを取り除くことで、Diagonal Batchingは推論コストとレイテンシを低減し、現実の長時間コンテキストアプリケーションのための実用的なソリューションとしてRTTを強化する。

論文の概要: Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

関連論文リスト