Related papers: The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

URL: http://arxiv.org/abs/2505.16932v2
Date: Tue, 03 Jun 2025 16:46:45 GMT
Title: The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm
Authors: Noah Amsel, David Persson, Christopher Musco, Robert M. Gower,
Abstract summary: Polar decomposition is a well-studied problem in numerical analysis for decades.<n>It has emerged as an important subroutine in deep learning, particularly within the Muon optimization framework.<n>We introduce Polar Express, a GPU-friendly algorithm for computing the polar decomposition.
Score: 13.161119762690618
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computing the polar decomposition and the related matrix sign function, has been a well-studied problem in numerical analysis for decades. More recently, it has emerged as an important subroutine in deep learning, particularly within the Muon optimization framework. However, the requirements in this setting differ significantly from those of traditional numerical analysis. In deep learning, methods must be highly efficient and GPU-compatible, but high accuracy is often unnecessary. As a result, classical algorithms like Newton-Schulz (which suffers from slow initial convergence) and methods based on rational functions (which rely on QR decompositions or matrix inverses) are poorly suited to this context. In this work, we introduce Polar Express, a GPU-friendly algorithm for computing the polar decomposition. Like classical polynomial methods such as Newton-Schulz, our approach uses only matrix-matrix multiplications, making it GPU-compatible. Motivated by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the polynomial update rule at each iteration by solving a minimax optimization problem, and we prove that it enjoys a strong worst-case optimality guarantee. This property ensures both rapid early convergence and fast asymptotic convergence. We also address finite-precision issues, making it stable in bfloat16 in practice. We apply Polar Express within the Muon optimization framework and show consistent improvements in validation loss on large-scale models such as GPT-2, outperforming recent alternatives across a range of learning rates.

Related papers

Inertial Quadratic Majorization Minimization with Application to Kernel Regularized Learning [1.0282274843007797]
We introduce the Quadratic Majorization Minimization with Extrapolation (QMME) framework and establish its sequential convergence properties.<n>To demonstrate practical advantages, we apply QMME to large-scale kernel regularized learning problems.
arXiv Detail & Related papers (2025-07-06T05:17:28Z)
WENDy for Nonlinear-in-Parameters ODEs [1.9573380763700712]
The Weak-form Estimation of Non-linear Dynamics (WEN) is extended to accommodate systems of ordinary differential equations that are nonlinear-ins.<n>We present results on a suite of benchmark systems to demonstrate the practical benefits of our approach.
arXiv Detail & Related papers (2025-02-13T01:40:21Z)
Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties. We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z)
Efficient Graph Laplacian Estimation by Proximal Newton [12.05527862797306]
A graph learning problem can be formulated as a maximum likelihood estimation (MLE) of the precision matrix. We develop a second-order approach to obtain an efficient solver utilizing several algorithmic features.
arXiv Detail & Related papers (2023-02-13T15:13:22Z)
The Curse of Unrolling: Rate of Differentiating Through Optimization [35.327233435055305]
Un differentiation approximates the solution using an iterative solver and differentiates it through the computational path. We show that we can either 1) choose a large learning rate leading to a fast convergence but accept that the algorithm may have an arbitrarily long burn-in phase or 2) choose a smaller learning rate leading to an immediate but slower convergence.
arXiv Detail & Related papers (2022-09-27T09:27:29Z)
Accelerated Proximal Alternating Gradient-Descent-Ascent for Nonconvex Minimax Machine Learning [12.069630105460766]
An Alternating Table-descentascent (AltGDA) is an computation optimization algorithm that has been widely used for training in various machine learning applications. In this paper, we develop a single-loop fast computation-of-the-loop gradient-of-the-loop algorithm to solve non minimax optimization problems.
arXiv Detail & Related papers (2021-12-22T04:33:27Z)
Newton-LESS: Sparsification without Trade-offs for the Sketched Newton Update [88.73437209862891]
In second-order optimization, a potential bottleneck can be computing the Hessian matrix of the optimized function at every iteration. We show that the Gaussian sketching matrix can be drastically sparsified, significantly reducing the computational cost of sketching. We prove that Newton-LESS enjoys nearly the same problem-independent local convergence rate as Gaussian embeddings.
arXiv Detail & Related papers (2021-07-15T17:33:05Z)
Minimax Optimization with Smooth Algorithmic Adversaries [59.47122537182611]
We propose a new algorithm for the min-player against smooth algorithms deployed by an adversary. Our algorithm is guaranteed to make monotonic progress having no limit cycles, and to find an appropriate number of gradient ascents.
arXiv Detail & Related papers (2021-06-02T22:03:36Z)
Recent Theoretical Advances in Non-Convex Optimization [56.88981258425256]
Motivated by recent increased interest in analysis of optimization algorithms for non- optimization in deep networks and other problems in data, we give an overview of recent results of theoretical optimization algorithms for non- optimization.
arXiv Detail & Related papers (2020-12-11T08:28:51Z)
Gradient Free Minimax Optimization: Variance Reduction and Faster Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem. We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z)
Hardness of Random Optimization Problems for Boolean Circuits, Low-Degree Polynomials, and Langevin Dynamics [78.46689176407936]
We show that families of algorithms fail to produce nearly optimal solutions with high probability. For the case of Boolean circuits, our results improve the state-of-the-art bounds known in circuit complexity theory.
arXiv Detail & Related papers (2020-04-25T05:45:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.