Fugu-MT 論文翻訳(概要): Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts

論文の概要: Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts

arxiv url: http://arxiv.org/abs/2602.02108v1
Date: Mon, 02 Feb 2026 13:52:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:34.18197
Title: Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts
Title（参考訳）: メモリバリアの外部:100万のコンテキストを持つLLMのための高効率学習システム
Authors: Wenhao Li, Daohai Yu, Gen Luo, Yuxin Zhang, Fei Chao, Rongrong Ji, Yifan Wu, Jiaxin Liu, Ziyang Gong, Zimu Liao,
Abstract要約: 長いコンテキストでの大規模言語モデル(LLM)のトレーニングは、トレーニング時間ではなく、GPUメモリの異常なオーバーヘッドによって厳しく制限される。この障壁に直面するメモリ効率の高いトレーニングシステムOOMBを紹介します。本手法では,オンザフライアクティベーション・リコンピュテーションを備えたチャンク・リカレント・トレーニング・フレームワークを用いて,一定のアクティベーションメモリフットプリントを維持する。
参考スコア（独自算出の注目度）: 68.79341332280062
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time. The primary culprits are the activations, whose memory footprints scale linearly with sequence length. We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier. Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint (O(1)) and shifts the primary bottleneck to the growing KV cache. To manage the KV cache, OOMB integrates a suite of synergistic optimizations: a paged memory manager for both the KV cache and its gradients to eliminate fragmentation, asynchronous CPU offloading to hide data transfer latency, and page-level sparse attention to reduce both computational complexity and communication overhead. The synergy of these techniques yields exceptional efficiency. Our empirical results show that for every additional 10K tokens of context, the end-to-end training memory overhead increases by a mere 10MB for Qwen2.5-7B. This allows training Qwen2.5-7B with a 4M-token context on a single H200 GPU, a feat that would otherwise require a large cluster using context parallelism. This work represents a substantial advance in resource efficiency for long-context LLM training. The source code is available at https://github.com/wenhaoli-xmu/OOMB.
Abstract（参考訳）: 長いコンテキストでの大規模言語モデル(LLM)のトレーニングは、トレーニング時間ではなく、GPUメモリの異常なオーバーヘッドによって厳しく制限される。主要な原因はアクティベーションであり、メモリフットプリントはシーケンス長と線形にスケールする。この障壁に直面するメモリ効率の高いトレーニングシステムOOMBを紹介します。提案手法では、オンザフライアクティベーション再計算によるチャンクリカレントトレーニングフレームワークを用いて、一定のアクティベーションメモリフットプリント(O(1))を維持し、主要なボトルネックを増大するKVキャッシュにシフトする。 KVキャッシュを管理するため、OOMBは、KVキャッシュとその勾配の両方のためのページメモリマネージャ、データの転送遅延を隠蔽するための非同期CPUオフロード、計算複雑性と通信オーバーヘッドの両方を減らすためのページレベルのスパースアテンションという、一連の相乗的な最適化を統合している。これらの手法の相乗効果は例外的な効率性をもたらす。実験の結果,Qwen2.5-7Bでは,10Kトークンを追加すると,エンドツーエンドのトレーニングメモリオーバーヘッドが10MB増加することがわかった。これにより、単一のH200 GPU上で4MのコンテキストでQwen2.5-7Bをトレーニングすることができる。この研究は、長期LLMトレーニングにおける資源効率の大幅な向上を示している。ソースコードはhttps://github.com/wenhaoli-xmu/OOMBで公開されている。

論文の概要: Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts

関連論文リスト