Related papers: Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

URL: http://arxiv.org/abs/2601.04890v1
Date: Thu, 08 Jan 2026 12:41:49 GMT
Title: Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
Authors: Maksim Velikanov, Ilyas Chahed, Jingwei Zuo, Dhia Eddine Rhaiem, Younes Belkada, Hakim Hacid,
Abstract summary: We introduce learnable multipliers to learn the optimal scale for applying weight decay to matrix layers.<n>Our method can be viewed as a learnable, more expressive generalization of muP multipliers.<n>It outperforms a well-tuned muP baseline, reduces the computational overhead of tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers.
Score: 11.445970271488095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.

Related papers

Stabilizing Native Low-Rank LLM Pretraining [24.2079184778031]
Low-rank factorization offers a promising route to reduce training and inference costs.<n>We demonstrate that Large Language Models (LLMs) can be trained from scratch using exclusively low-rank factorized weights.<n>Our method enables stable, end-to-end factorized training with negligible overhead.
arXiv Detail & Related papers (2026-02-12T21:33:14Z)
The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL [39.23942538769713]
Reinforcement Learning for Large Language Models (LLMs) often suffers from training collapse in long-horizon tasks due to exploding gradient variance.<n>We derive the Optimal Token Baseline (OTB) from first principles, proving that gradient updates should be weighted inversely to their cumulative gradient norm.<n>Our method achieves training stability and matches the performance of large group sizes with only $N=32$, reducing token consumption by over 65% across single-turn and tool-integrated reasoning tasks.
arXiv Detail & Related papers (2026-02-06T03:16:04Z)
REG: A Regularization Optimizer for Robust Training Dynamics [24.850151895583494]
Row-and-Column-Scaling (RACS) operator regularizes the update steps in a less drastic manner, making it simpler to implement and more compatible with established training dynamics.<n>We demonstrate that our REG achieves superior performance and stability over AdamW, but also maintains consistency with the AdamW training paradigm.
arXiv Detail & Related papers (2025-10-04T06:05:57Z)
Hyperspherical Normalization for Scalable Deep Reinforcement Learning [57.016639036237315]
SimbaV2 is a novel reinforcement learning architecture designed to stabilize optimization.<n>It scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks.
arXiv Detail & Related papers (2025-02-21T08:17:24Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We propose a new family of algorithms that uses the linear minimization oracle (LMO) to adapt to the geometry of the problem.<n>We demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam.
arXiv Detail & Related papers (2025-02-11T13:10:34Z)
Large Continual Instruction Assistant [59.585544987096974]
Continual Instruction Tuning (CIT) is adopted to instruct Large Models to follow human intent data by data.<n>Existing update gradient would heavily destroy the performance on previous datasets during CIT process.<n>We propose a general continual instruction tuning framework to address the challenge.
arXiv Detail & Related papers (2024-10-08T11:24:59Z)
Big Learning Expectation Maximization [13.709094150105566]
We present the Big Learning EM (BigLearn-EM), an EM upgrade that simultaneously performs joint, marginal, and orthogonally transformed marginal matchings. We empirically show that the BigLearn-EM is capable of delivering the optimal with high probability.
arXiv Detail & Related papers (2023-12-19T08:07:41Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Long-Tailed Recognition via Weight Balancing [66.03068252811993]
Naive training produces models that are biased toward common classes in terms of higher accuracy. We investigate three techniques to balance weights, L2-normalization, weight decay, and MaxNorm. Our approach achieves the state-of-the-art accuracy on five standard benchmarks.
arXiv Detail & Related papers (2022-03-27T03:26:31Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.