Related papers: PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective

PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective

URL: http://arxiv.org/abs/2505.21799v1
Date: Tue, 27 May 2025 22:11:21 GMT
Title: PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective
Authors: Tim Tsz-Kit Lau, Qi Long, Weijie Su,
Abstract summary: We introduce a unifying framework for analyzing "matrix-aware" preconditioned methods.<n>We introduce PolarGrad, a new class of preconditioned optimization methods based on the polar decomposition of matrix-valued gradients.
Score: 6.497756166630786
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ever-growing scale of deep learning models and datasets underscores the critical importance of efficient optimization methods. While preconditioned gradient methods such as Adam and AdamW are the de facto optimizers for training neural networks and large language models, structure-aware preconditioned optimizers like Shampoo and Muon, which utilize the matrix structure of gradients, have demonstrated promising evidence of faster convergence. In this paper, we introduce a unifying framework for analyzing "matrix-aware" preconditioned methods, which not only sheds light on the effectiveness of Muon and related optimizers but also leads to a class of new structure-aware preconditioned methods. A key contribution of this framework is its precise distinction between preconditioning strategies that treat neural network weights as vectors (addressing curvature anisotropy) versus those that consider their matrix structure (addressing gradient anisotropy). This perspective provides new insights into several empirical phenomena in language model pre-training, including Adam's training instabilities, Muon's accelerated convergence, and the necessity of learning rate warmup for Adam. Building upon this framework, we introduce PolarGrad, a new class of preconditioned optimization methods based on the polar decomposition of matrix-valued gradients. As a special instance, PolarGrad includes Muon with updates scaled by the nuclear norm of the gradients. We provide numerical implementations of these methods, leveraging efficient numerical polar decomposition algorithms for enhanced convergence. Our extensive evaluations across diverse matrix optimization problems and language model pre-training tasks demonstrate that PolarGrad outperforms both Adam and Muon.

Related papers

Improving Adaptive Moment Optimization via Preconditioner Diagonalization [11.01832755213396]
We show that our approach can substantially enhance the convergence speed of modern adaptives.<n>For large language models like LLaMA, we can achieve a speedup of 2x compared to the baseline Adam.
arXiv Detail & Related papers (2025-02-11T11:48:04Z)
On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning [22.486361028522374]
We show that layer-wise preconditioning methods are provably necessary from a statistical perspective.<n>We show that SGD is a suboptimal feature when extending beyond ideal isotropic inputs.<n>We show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues.
arXiv Detail & Related papers (2025-02-03T19:08:32Z)
Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective. We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices. Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z)
Neural incomplete factorization: learning preconditioners for the conjugate gradient method [2.899792823251184]
We develop a data-driven approach to accelerate the generation of effective preconditioners. We replace the typically hand-engineered preconditioners by the output of graph neural networks. Our method generates an incomplete factorization of the matrix and is, therefore, referred to as neural incomplete factorization (NeuralIF)
arXiv Detail & Related papers (2023-05-25T11:45:46Z)
Learning Large-scale Neural Fields via Context Pruned Meta-Learning [60.93679437452872]
We introduce an efficient optimization-based meta-learning technique for large-scale neural field training. We show how gradient re-scaling at meta-test time allows the learning of extremely high-quality neural fields. Our framework is model-agnostic, intuitive, straightforward to implement, and shows significant reconstruction improvements for a wide range of signals.
arXiv Detail & Related papers (2023-02-01T17:32:16Z)
Towards Theoretically Inspired Neural Initialization Optimization [66.04735385415427]
We propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network. We show that both the training and test performance of a network can be improved by maximizing GradCosine under norm constraint. Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost.
arXiv Detail & Related papers (2022-10-12T06:49:16Z)
A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning [0.6526824510982802]
Adaptive gradient methods have become popular in optimizing deep neural networks. Recent examples include AdaGrad and Adam. We develop a generic framework for adaptive gradient methods.
arXiv Detail & Related papers (2022-06-04T17:55:33Z)
Gradient Descent, Stochastic Optimization, and Other Tales [8.034728173797953]
This tutorial doesn't shy away from addressing both the formal and informal aspects of gradient descent and optimization methods. Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize machine learning tasks. In deep neural networks, the gradient followed by a single sample or a batch of samples is employed to save computational resources and escape from saddle points.
arXiv Detail & Related papers (2022-05-02T12:06:53Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
Cogradient Descent for Dependable Learning [64.02052988844301]
We propose a dependable learning based on Cogradient Descent (CoGD) algorithm to address the bilinear optimization problem. CoGD is introduced to solve bilinear problems when one variable is with sparsity constraint. It can also be used to decompose the association of features and weights, which further generalizes our method to better train convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-06-20T04:28:20Z)
Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem. We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent. Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)
Scalable Second Order Optimization for Deep Learning [34.12384996822749]
We present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad) Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.
arXiv Detail & Related papers (2020-02-20T20:51:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.