Related papers: MARS-M: When Variance Reduction Meets Matrices

MARS-M: When Variance Reduction Meets Matrices

URL: http://arxiv.org/abs/2510.21800v2
Date: Tue, 28 Oct 2025 09:27:41 GMT
Title: MARS-M: When Variance Reduction Meets Matrices
Authors: Yifeng Liu, Angela Yuan, Quanquan Gu,
Abstract summary: Matrix-based preconditioneds have been shown to be more efficient than scalar-based preconditioneds for large-scale neural networks.<n>We introduce MARS-M, a new technique that integrates the variance reduction technique in MARS with Muon.<n>Our empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks.
Score: 47.405031764674014
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). On the other hand, recent benchmarks on optimizers for LLM pre-training have demonstrated that variance-reduction techniques such as MARS can achieve substantial speedups over standard optimizers that do not employ variance reduction. In this paper, to achieve the best of both worlds, we introduce MARS-M, a new optimizer that integrates the variance reduction technique in MARS with Muon. Under standard regularity conditions, we prove that Muon-M converges to a first-order stationary point at a rate of $\tilde{\mathcal{O}}(T^{-1/3})$, which improves upon $\tilde{\mathcal{O}}(T^{-1/4})$ rate attained by Muon. Our empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/tree/main/MARS_M.

Related papers

MARS: Harmonizing Multimodal Convergence via Adaptive Rank Search [12.345218777941108]
Fine-tuning Multimodal Large Language Models (MLLMs) with parameter-efficient methods like Low-Rank Adaptation (LoRA) is crucial for task adaptation.<n>We introduce MARS (Multimodal Adaptive Rank Search), an approach to discover optimal rank pairs that balance training dynamics while maximizing performance.<n>Our key innovation, a proposed framework of dual scaling laws, enables this search: one law models module-specific convergence time to prune the search space to candidates with aligned dynamics, while the other predicts final task performance to select the optimal pair from the pruned set.
arXiv Detail & Related papers (2026-02-28T15:58:28Z)
Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum [19.385264518362472]
Large Language Models (LLMs) achieve competitive performance across diverse natural language processing (NLP) tasks.<n>We propose two variants of Muon, Muon-NSR and Muon-VS, which apply variance-adaptive normalization to momentum.<n>Experiments on GPT-2 and LLaMA pretraining demonstrate that our proposed methods accelerate convergence and consistently achieve lower validation loss than both competitive, well-tuned AdamW and Muon baselines.
arXiv Detail & Related papers (2026-01-21T02:41:56Z)
Muon is Provably Faster with Momentum Variance Reduction [55.388203260208485]
Recent empirical research has demonstrated that deep learnings based on the linear linear oracle (LMO) over specifically chosen Non-Eudean.<n>Adam-type training methods outperform the minimization of large language models.
arXiv Detail & Related papers (2025-12-18T14:38:39Z)
REG: A Regularization Optimizer for Robust Training Dynamics [24.850151895583494]
Row-and-Column-Scaling (RACS) operator regularizes the update steps in a less drastic manner, making it simpler to implement and more compatible with established training dynamics.<n>We demonstrate that our REG achieves superior performance and stability over AdamW, but also maintains consistency with the AdamW training paradigm.
arXiv Detail & Related papers (2025-10-04T06:05:57Z)
Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order [39.25335214877435]
Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks.<n>Traditional first-order algorithms incur prohibitive memory and computational costs that scale poorly with model size.<n>We propose zero-order (ZO) optimization methods as a memory- and compute-efficient alternative.
arXiv Detail & Related papers (2025-06-04T20:27:17Z)
Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks.<n>By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.67982828148859]
We propose a unified training framework for deep neural networks.<n>We introduce three instances of MARS that leverage preconditioned gradient optimization.<n>Results indicate that the implementation of MARS consistently outperforms Adam.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [63.10833446782114]
As language models grow in size, memory demands for backpropagation increase.<n>Zeroth-order (ZO) optimization methods offer a memory-efficient alternative.<n>In this paper, we propose Subspace Zero-order optimization to address the challenges posed by posed by high dimensionality perturbations.
arXiv Detail & Related papers (2024-10-11T17:01:43Z)
Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks. Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients. We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.