Fugu-MT 論文翻訳(概要): MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training

論文の概要: MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training

arxiv url: http://arxiv.org/abs/2508.20577v1
Date: Thu, 28 Aug 2025 09:14:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-29 18:12:02.26254
Title: MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training
Title（参考訳）: MERIT:言語モデル大規模バッチ学習のための最大正規化要素単位比
Authors: Yang Luo, Zangwei Zheng, Ziheng Qin, Zirui Zhu, Yong Liu, Yang You,
Abstract要約: 大規模バッチトレーニングは、ディープニューラルネットワークのトレーニングを加速する上での基礎となっている。本研究は,大規模バッチトレーニングにおいて,最大注意ロジットと細粒度信頼率を考慮することの重要性を強調した。トレーニングの安定性を向上し、より大きなバッチ使用の道を開くことで、大規模言語モデルの迅速な開発とイテレーションを可能にします。
参考スコア（独自算出の注目度）: 30.4584028979212
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-batch training has become a cornerstone in accelerating the training of deep neural networks, yet it poses challenges in optimization and generalization. Existing optimizers like AdamW present performance degradation during language models' large-batch training, due to the information bottleneck in attention layers caused by the sharp increase of max attention logit. While the LAMB optimizer partially addresses this issue, some attention layers still face this issue. The reason is that $l_2$-norm-based trust ratios in LAMB are less effective in directly influencing the max value of query/key weights. Furthermore, the weight-wise trust ratio in LAMB is error-prone as it overlooks relationships of weight values within rows or columns. Building on these observations, we propose a novel optimizer, MERIT, which leverages the max-norm to calculate the trust ratio to constrain the max attention logit more effectively. Moreover, we further construct element-wise trust ratios to provide more robust update scaling by focusing on local weight structures. Extensive experiments of large-batch training across various sizes of GPT-2 models demonstrate the superior performance of MERIT. Notably, during the training of GPT-2 Medium, MERIT enables a 6k batch size without any performance degradation compared to the standard batch size (480) with 48B training tokens. This work highlights the importance of considering the max attention logit and finer-granularity trust ratio in large-batch training. It successfully improves the training stability and paves the way for larger batch usage, enabling faster development and iteration of large language models. Code is available at https://github.com/NUS-HPC-AI-Lab/MERIT.
Abstract（参考訳）: 大規模バッチトレーニングは、ディープニューラルネットワークのトレーニングを加速する上で基盤となっているが、最適化と一般化の課題を提起している。 AdamW氏のような既存のオプティマイザは、言語モデルの大規模バッチトレーニングにおいて、最大アテンションロジットの急激な増加によるアテンションレイヤの情報ボトルネックのため、パフォーマンスの低下を示す。 LAMBオプティマイザは部分的にこの問題に対処するが、いくつかの注意層はまだこの問題に直面している。 LAMBの$l_2$-normベースの信頼比は、クエリ/キー重みの最大値に直接影響しないためである。さらに、LAMBの重み付け信頼比は、行や列内の重み値の関係を見落としているため、エラーを起こしやすい。これらの観測に基づいて,最大ノルムを利用して信頼率を計算し,より効果的に注目ロジットを制約する新しいオプティマイザMERITを提案する。さらに,局所的な重み構造に着目して,より堅牢な更新スケーリングを実現するため,要素単位の信頼比をさらに構築する。 GPT-2モデルの多種多種多様な大バッチ学習実験により,MERITの優れた性能が示された。特に、GPT-2 Mediumのトレーニング中、MERITは、48Bのトレーニングトークンを持つ標準的なバッチサイズ(480)と比較して、パフォーマンスの劣化のない6kバッチサイズを実現している。本研究は,大規模バッチトレーニングにおいて,最大注意ロジットと細粒度信頼率を考慮することの重要性を強調した。トレーニングの安定性を向上し、より大きなバッチ使用の道を開くことで、大規模言語モデルの迅速な開発とイテレーションを可能にします。コードはhttps://github.com/NUS-HPC-AI-Lab/MERITで入手できる。

論文の概要: MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training

関連論文リスト