Fugu-MT 論文翻訳(概要): Efficient Long-context Language Model Training by Core Attention Disaggregation

論文の概要: Efficient Long-context Language Model Training by Core Attention Disaggregation

arxiv url: http://arxiv.org/abs/2510.18121v1
Date: Mon, 20 Oct 2025 21:40:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:12.640037
Title: Efficient Long-context Language Model Training by Core Attention Disaggregation
Title（参考訳）: コアアテンション・デアグリゲーションによる長文言語モデル学習の効率化
Authors: Yonghao Zhuang, Junda Chen, Bo Pang, Yi Gu, Yibo Zhu, Yimin Jiang, Ion Stoica, Eric Xing, Hao Zhang,
Abstract要約: 本稿では,コアアテンション計算,ソフトマックス(QKT)Vをモデルの他の部分から分離することにより,長文大言語モデルの訓練を改善する手法を提案する。本研究では,DistCAと呼ばれるシステムにCADを実装し,Ping-pong実行方式を用いて,計算処理と通信を重複させ,アテンションサーバ上でのインプレース実行によりメモリ使用量の削減を図る。
参考スコア（独自算出の注目度）: 40.14172357304901
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level tasks and dispatches them to dedicated attention servers, which dynamically rebatch tasks to equalize compute without sacrificing kernel efficiency. We implement CAD in a system called DistCA, which uses a ping-pong execution scheme to fully overlap communication with computation and in-place execution on attention servers to reduce memory use. On 512 H200 GPUs and context lengths up to 512k tokens, DistCA improves end-to-end training throughput by up to 1.35x, eliminates data and pipeline parallel stragglers, and achieves near-perfect compute and memory balance.
Abstract（参考訳）: 我々は、コアアテンション計算であるソフトマックス(QK^T)Vをモデルの他の部分から切り離し、別のデバイスで実行することで、長文大言語モデルのトレーニングを改善する技術である、コアアテンション・デアグリゲーション(CAD)を提案する。既存のシステムでは、コアアテンションは他のレイヤと同じ位置にあり、長いコンテキストでは、他のコンポーネントのほぼ直線的な成長に比べて二次的な計算成長は、データとパイプラインの並列グループ間の負荷不均衡とストラグラーを引き起こす。 CADは2つの観測によって実現される。トレーニング可能なパラメータがなく、最小限のトランジェントデータしか持たないため、バランシングは計算バウンドタスクのスケジューリングに還元される。第二に、現代の注目カーネルは、任意の長さのトークンレベルシャードの融合バッチを処理する際に高い効率を維持する。 CADはコアをトークンレベルのタスクに分割し、専用のアテンションサーバにディスパッチする。本研究では,DistCAと呼ばれるシステムにCADを実装し,Ping-pong実行方式を用いて,計算処理と通信を重複させ,アテンションサーバ上でのインプレース実行によりメモリ使用量の削減を図る。 512 H200 GPUとコンテキスト長最大512kトークンでは、DistCAはエンドツーエンドのトレーニングスループットを最大1.35倍改善し、データとパイプライン並列ストラグラーを排除し、ほぼ完璧な計算とメモリバランスを実現する。

論文の概要: Efficient Long-context Language Model Training by Core Attention Disaggregation

関連論文リスト