Fugu-MT 論文翻訳(概要): FORGE: Fused On-Register Gradient Elimination for Memory-Efficient LLM Training

論文の概要: FORGE: Fused On-Register Gradient Elimination for Memory-Efficient LLM Training

arxiv url: http://arxiv.org/abs/2606.22932v1
Date: Mon, 22 Jun 2026 07:08:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 03:27:38.327478
Title: FORGE: Fused On-Register Gradient Elimination for Memory-Efficient LLM Training
Title（参考訳）: FORGE: メモリ効率の良いLLMトレーニングのためのレジストオンレジストグラディエント除去
Authors: Dikshant Kukreja, Kritarth Prasad, Avinash Anand, Zhengkui Wang, Erik Cambria, Timothy Liu, Aik Beng Ng, Simon See, Bapi Chatterjee,
Abstract要約: FORGEはステップを後方のパスに折りたたみ、一度に1つのタイルを全てレジスタに印加するので、各タイルは生成した瞬間に消費され、テンソルになることはない。すべての要素ワイズ更新において、すべての要素ワイズルールにおいて、すべての要素ワイズルールにおいて、すべての正確性は生存し、シーケンス並列化される。
参考スコア（独自算出の注目度）: 43.28268986528884
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reverse-mode differentiation computes every weight gradient, writes it to memory, and only then lets the optimizer read it back. This two-phase schedule sets the memory ceiling of modern training: at the seam between the phases, every layer's gradient is live at once. We argue that this materialized gradient is an artifact of how differentiation is staged, not a quantity that learning requires -- and we eliminate it. FORGE folds the optimizer step into the backward pass and applies it one tile at a time, entirely in registers, so each gradient tile is consumed the instant it is produced and never becomes a tensor. The fusion changes only when the update happens, not what it computes: in full precision the fused step is provably exact -- the identical optimizer update, for every element-wise rule -- and that exactness survives tensor- and sequence-parallel sharding; in the bf16 and 8-bit regimes used in practice it is faithful rather than bit-identical, its deviation bounded and, for the weight store, rendered unbiased by stochastic rounding. Because each gradient tile is born and consumed in the same registers, it is never converted down to bf16 to be stored and read back; FORGE thus preserves the full-precision fidelity that both bf16 and 8-bit optimizers lose to that conversion. Nor is the method tied to one architecture or one optimizer: linear layers are ubiquitous, and FORGE reclaims the gradient memory of any of them under any element-wise rule. Empirically FORGE more than halves the memory of an optimizer step and, at the small batch sizes typical of fine-tuning and continued pretraining, runs about 1.5x faster; integrated into tensor-parallel Megatron-LM it fits 8B training at four times the micro-batch a standard optimizer allows on the same GPUs.
Abstract（参考訳）: 逆モードの微分は、すべてのウェイト勾配を計算し、それをメモリに書き込む。この2段階のスケジュールは、現代のトレーニングの記憶の天井を定めている。 FORGEはオプティマイザのステップを後方パスに折り畳み、一度に1つのタイルをレジスタに入れ、各グレードタイルが生成した瞬間に消費され、それがテンソルになることはない。融合は、それが計算されるものではなく、更新が発生したときだけ変化します。完全な精度で、融合されたステップは、証明可能な正確さ – すべての要素ワイドルールにおいて、同じオプティマイザの更新 -- そして、その正確さは、テンソルとシーケンスパラレルのシャーディングを継続します。bf16と8ビットの体制において、それはビット識別子よりも忠実で、その偏りは限定的であり、その差分は、ストアによってレンダリングされる。各勾配タイルは、同じレジスタで生まれ、消費されるため、格納され、読み返されるようにbf16に変換されることはない。線形層はユビキタスであり、ForGEはそれらのいずれかの勾配メモリを任意の要素ワイドルールの下で再利用する。実証的には、ForGEはオプティマイザステップのメモリを半減し、微調整と継続的な事前トレーニングの典型的な小さなバッチサイズでは、約1.5倍高速で動作し、テンソル並列のMegatron-LMに統合され、標準オプティマイザが同じGPUで許容するマイクロバッチの4倍の8Bトレーニングに適合する。

論文の概要: FORGE: Fused On-Register Gradient Elimination for Memory-Efficient LLM Training

関連論文リスト