Fugu-MT 論文翻訳(概要): Gefen: Optimized Stochastic Optimizer

論文の概要: Gefen: Optimized Stochastic Optimizer

arxiv url: http://arxiv.org/abs/2606.13894v1
Date: Thu, 11 Jun 2026 20:38:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 16:00:42.641962
Title: Gefen: Optimized Stochastic Optimizer
Title（参考訳）: Gefen: 最適化確率最適化
Authors: Nadav Benedek, Tomer Koren, Ohad Fried,
Abstract要約: AdamWは現代のディープラーニングのデフォルトですが、第1と第2の瞬間状態は、メモリのトレーニングにおよそ2つのパラメータサイズのバッファを追加します。本稿では,パラメータブロック間で第2モーメント推定値を共有するメモリ効率のよいコードブックであるGefenを提案し,学習したコードブックを用いて第1モーメントを定量化する。我々は、GefenがAdamWのメモリフットプリントを8倍に削減し、同じ性能を維持していることを示す。
参考スコア（独自算出の注目度）: 32.771995350910125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AdamW is a default optimizer for modern deep learning, but its first and second moment states add roughly two parameter-sized buffers to training memory. We propose Gefen, a memory-efficient optimizer that automatically shares second-moment estimates across parameter blocks and quantizes the first moment using a learned codebook, thereby reducing AdamW's memory footprint by ~8x while maintaining the same performance, corresponding to a reduction of 6.5 GiB per billion parameters. The method is motivated by a theoretical result showing that large mixed Hessian entries constrain the ratio of squared gradients toward one, suggesting that Hessian-aligned parameters are natural candidates for sharing second-moment statistics. Since computing Hessians is impractical at scale, Gefen infers block structure from the initial squared gradients, requiring no architecture-specific metadata or hyperparameters beyond AdamW defaults. Gefen learns an exact histogram-based dynamic-programming quantization codebook and reuses the same blocks for first-moment scaling. Across diverse experiments, Gefen achieves the lowest peak optimizer memory among the compared AdamW-like methods while maintaining AdamW-level performance. In FSDP and DDP training, the reduced memory footprint enables larger microbatches and improves throughput significantly over AdamW, providing a practical drop-in replacement with lower memory usage that can increase throughput and enable training larger models or using larger batch sizes. We provide the complete Python implementation, including fused CUDA kernels at https://github.com/ndvbd/Gefen
Abstract（参考訳）: AdamWは、現代のディープラーニングのためのデフォルトのオプティマイザであるが、第1と第2のモーメント状態は、メモリのトレーニングにおよそ2つのパラメータサイズのバッファを追加する。本稿では,パラメータブロック間で第2モーメント推定を自動的に共有し,学習したコードブックを用いて第1モーメントを定量化することで,AdamWのメモリフットプリントを約8倍削減し,同じ性能を維持しながら,10億のパラメータに対して6.5ギBの削減を実現するメモリ効率最適化手法Gefenを提案する。この方法は、大きな混合ヘッセン成分が正方勾配の比を1に制限していることを示し、ヘッセン整列パラメータが第二モーメント統計を共有する自然な候補であることを示唆する理論的な結果によって動機付けられている。計算ヘッセンは大規模では実用的ではないため、Gefenは最初の2乗勾配からブロック構造を推定し、アーキテクチャ固有のメタデータやAdamWデフォルトを超えるハイパーパラメータを必要としない。 Gefenは、正確なヒストグラムベースの動的プログラミング量子化コードブックを学び、最初のモーメントスケーリングのために同じブロックを再利用する。様々な実験において、GefenはAdamWレベルの性能を維持しながら比較したAdamWライクな手法の中で最小のピークオプティマイザメモリを達成している。 FSDPとDDPのトレーニングでは、メモリフットプリントの削減により、マイクロバッチが大きくなり、AdamWよりもスループットが大幅に向上する。我々は完全なPython実装を提供し、https://github.com/ndvbd/GefenでCUDAカーネルを融合させた。

論文の概要: Gefen: Optimized Stochastic Optimizer

関連論文リスト