Fugu-MT 論文翻訳(概要): Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

論文の概要: Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

arxiv url: http://arxiv.org/abs/2601.04890v1
Date: Thu, 08 Jan 2026 12:41:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-09 17:01:53.204144
Title: Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
Title（参考訳）: 学習可能な乗算器:言語モデル行列層のスケールを解放する
Authors: Maksim Velikanov, Ilyas Chahed, Jingwei Zuo, Dhia Eddine Rhaiem, Younes Belkada, Hakim Hacid,
Abstract要約: 学習可能な乗算器を導入し,行列層に重み減衰を適用するための最適尺度を学習する。この手法は muP 乗算器の学習可能で表現性の高い一般化とみなすことができる。十分に調整された muP ベースラインを上回り、チューニングの計算オーバーヘッドを減らし、前方通過対称性や学習した乗算器の幅スケーリングといった実用的な質問を表面化する。
参考スコア（独自算出の注目度）: 11.445970271488095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.
Abstract（参考訳）: 行列層に重量減衰(WD)を適用することは、大規模言語モデル事前学習の標準的な実践である。以前の研究は、確率勾配ノイズは、WDに反作用する重量行列 W のブラウン的な膨張を誘導し、ある重みノルム ||W|| と WD-ノイズ平衡をもたらすことを示唆している。本研究では,均衡規範をトレーニング手順の有害なアーティファクトとみなし,学習可能な乗算器を導入して,最適な尺度を学習する。まず、学習可能なスカラー乗算器をWに取り付け、WD-ノイズ平衡ノルムが最適以下であることを確認する。次に、各行と列のノルムも同様に制約され、学習可能な1行当たりおよびカラムごとの乗算器を導入することで、そのスケールを解放する。この手法は muP 乗算器の学習可能で表現性の高い一般化とみなすことができる。十分に調整された muP ベースラインを上回り、乗算器チューニングの計算オーバーヘッドを減らし、前方通過対称性や学習した乗算器の幅スケーリングといった実用的な問題を表面化する。最後に、学習可能な乗算器をAdamとMuonの最適化器で検証し、AdamからMuonへの切り替えの改善と一致する下流評価の改善を示す。

論文の概要: Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

関連論文リスト