Fugu-MT 論文翻訳(概要): Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints

論文の概要: Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints

arxiv url: http://arxiv.org/abs/2510.22467v1
Date: Sun, 26 Oct 2025 00:50:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:15.212466
Title: Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints
Title（参考訳）: Backward-Friendly Optimization: メモリ制約下での近似勾配による大規模言語モデルの訓練
Authors: Jing Yang, Kaitong Cai, Yijia Fan, Yufeng Yang, Keze Wang,
Abstract要約: LLM(Large Language Models)の完全な微調整は、メモリ集約性で悪名高い。 GradLiteは、正確な勾配の要求を緩和する後方フレンドリーなソリューションである。我々はGradLiteが有界な分散を伴う不偏推定を維持し、Adamに匹敵する収束率を保証することを示す。
参考スコア（独自算出の注目度）: 14.20716202034732
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Full fine-tuning of Large Language Models (LLMs) is notoriously memory-intensive, primarily because conventional optimizers such as SGD or Adam assume access to exact gradients derived from cached activations. Existing solutions either alter the model architecture (e.g., reversible networks) or trade memory for computation (e.g., activation checkpointing), but the optimizer itself remains untouched. In this work, we introduce GradLite, a backward-friendly optimizer that relaxes the requirement of exact gradients, enabling efficient training even when intermediate activations are aggressively discarded or approximated. GradLite leverages two key techniques: (i) low-rank Jacobian approximation, which reduces the dimensionality of backpropagated error signals, and (ii) error-feedback correction, which accumulates and compensates approximation errors across iterations to preserve convergence guarantees. We provide a theoretical analysis showing that GradLite maintains unbiased gradient estimates with bounded variance, ensuring convergence rates comparable to Adam. Empirically, GradLite reduces optimizer-state and activation memory consumption by up to 50\% without architectural changes, and achieves on-par or superior downstream performance on reasoning (MMLU, GSM8K), multilingual, and dialogue benchmarks compared to checkpointing and optimizer-centric baselines (LoMo, GaLore).
Abstract（参考訳）: LLM(Large Language Models)の完全な微調整は、主にSGDやAdamのような従来のオプティマイザがキャッシュされたアクティベーションに由来する正確な勾配へのアクセスを前提としていることから、メモリ集約で知られている。既存のソリューションでは、モデルアーキテクチャ(例えば、可逆性ネットワーク)や計算用メモリ(例えば、アクティベーションチェックポイント)が変更されているが、オプティマイザ自体は変更されていない。本研究では, 正確な勾配の要求を緩和し, 中間活性化が積極的に破棄されたり, 近似されたりしても, 効率的なトレーニングを可能にする, 後方対応の最適化器であるGradLiteを紹介する。 GradLiteは2つの重要なテクニックを活用している。 (i)バックプロパゲート誤り信号の次元を減少させる低ランクジャコビアン近似、及び (ii) 収束保証を維持するために繰り返しにわたって近似誤差を蓄積・補償するエラーフィードバック補正。我々は,GradLiteが非バイアス勾配推定を有界分散で維持し,Adamに匹敵する収束率を保証していることを示す理論的解析を行った。経験的に、GradLiteはアーキテクチャの変更なしに、最適化状態とアクティベーションメモリの消費を最大50%削減し、チェックポイントやオプティマイザ中心のベースライン(LoMo、GaLore)と比較して、推論(MMLU、GSM8K)、マルチリンガル、ダイアログのベンチマークにおいて、オンパーまたはより優れたダウンストリームのパフォーマンスを達成する。

論文の概要: Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints

関連論文リスト