Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning
- URL: http://arxiv.org/abs/2512.04632v1
- Date: Thu, 04 Dec 2025 10:06:22 GMT
- Title: Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning
- Authors: Thibaut Boissin, Thomas Massena, Franck Mamalet, Mathieu Serrurier,
- Abstract summary: We introduce a preconditioning procedure that accelerates Newton-Schulz convergence and reduces its computational cost.<n>Our publicly available implementation achieves up to a 2.8x speedup in the Newton-Schulz approximation.
- Score: 7.966927192439667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Orthogonality-based optimizers, such as Muon, have recently shown strong performance across large-scale training and community-driven efficiency challenges. However, these methods rely on a costly gradient orthogonalization step. Even efficient iterative approximations such as Newton-Schulz remain expensive, typically requiring dozens of matrix multiplications to converge. We introduce a preconditioning procedure that accelerates Newton-Schulz convergence and reduces its computational cost. We evaluate its impact and show that the overhead of our preconditioning can be made negligible. Furthermore, the faster convergence it enables allows us to remove one iteration out of the usual five without degrading approximation quality. Our publicly available implementation achieves up to a 2.8x speedup in the Newton-Schulz approximation. We also show that this has a direct impact on end-to-end training runtime with 5-10% improvement in realistic training scenarios across two efficiency-focused tasks. On challenging language or vision tasks, we validate that our method maintains equal or superior model performance while improving runtime. Crucially, these improvements require no hyperparameter tuning and can be adopted as a simple drop-in replacement. Our code is publicly available on github.
Related papers
- Gradient-Free Training of Quantized Neural Networks [9.348959582516438]
Training neural networks requires significant computational resources and energy.<n>Mixed-precision and quantization-aware training reduce bit usage, yet they still depend heavily on computationally expensive gradient-based optimization.<n>We propose a paradigm shift: eliminate gradients altogether.
arXiv Detail & Related papers (2024-10-13T05:38:39Z) - Gradient descent with generalized Newton's method [8.885727065823156]
We propose a Hessian-informed approach that applies to any topics such as SGD and Adam.<n>Our method automatically and dynamically selects the learning rate that accelerates the convergence.<n>In practice, our method is easily implementable, since it only requires additional forward passes with almost zero computational overhead.
arXiv Detail & Related papers (2024-07-03T03:01:43Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - A Computationally Efficient Sparsified Online Newton Method [48.78646010774149]
Sparsified Online Newton (SONew) is a memory-efficient second-order algorithm that yields a sparsified yet effective preconditioner.
We achieve up to 30% faster convergence, 3.4% relative improvement in validation, and 80% relative improvement in training loss.
arXiv Detail & Related papers (2023-11-16T18:44:22Z) - Towards Compute-Optimal Transfer Learning [82.88829463290041]
We argue that zero-shot structured pruning of pretrained models allows them to increase compute efficiency with minimal reduction in performance.
Our results show that pruning convolutional filters of pretrained models can lead to more than 20% performance improvement in low computational regimes.
arXiv Detail & Related papers (2023-04-25T21:49:09Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - Domain Adversarial Training: A Game Perspective [80.3821370633883]
This paper defines optimal solutions in domain-adversarial training from a game theoretical perspective.
We show that descent in domain-adversarial training can violate the convergence guarantees of the gradient, oftentimes hindering the transfer performance.
Ours are easy to implement, free of additional parameters, and can be plugged into any domain-adversarial framework.
arXiv Detail & Related papers (2022-02-10T22:17:30Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Optimizer Fusion: Efficient Training with Better Locality and
Parallelism [11.656318345362804]
Experimental results show that we can achieve an up to 20% training time reduction on various configurations.
Since our methods do not alter the algorithm, they can be used as a general "plug-in" technique to the training process.
arXiv Detail & Related papers (2021-04-01T03:44:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.