Related papers: Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

URL: http://arxiv.org/abs/2506.03595v1
Date: Wed, 04 Jun 2025 05:55:41 GMT
Title: Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner
Authors: Runa Eschenhagen, Aaron Defazio, Tsung-Hsien Lee, Richard E. Turner, Hao-Jun Michael Shi,
Abstract summary: Recent success of Shampoo in the computationPerf contest has sparked renewed interest in Kroneckerfactorization-based optimization algorithms for training neural networks.<n>We show that grafting from Adam directly mitigates the staleness and misscaling of the preconditioner's eigenvalues and how correcting the eigenvalues can eliminate the need for learning rate grafting.
Score: 22.81536065294916
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner's eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner's eigenvalues and how correcting the eigenvalues directly can eliminate the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo's heuristics and developing improved Kronecker-factorization-based training algorithms.

Related papers

Zero-Variance Gradients for Variational Autoencoders [32.818968022327866]
Training deep generative models like Variational Autoencoders (VAEs) is often hindered by the need to backpropagate gradients through sampling of their latent variables.<n>In this paper, we propose a new perspective that sidesteps this problem, which we call Silent Gradients.<n>Instead of improving estimators, we leverage specific decoder architectures analytically to compute the expected ELBO, yielding a gradient with zero variance.
arXiv Detail & Related papers (2025-08-05T15:54:21Z)
Deep Equilibrium models for Poisson Imaging Inverse problems via Mirror Descent [7.248102801711294]
Deep Equilibrium Models (DEQs) are implicit neural networks with fixed points.<n>We introduce a novel DEQ formulation based on Mirror Descent defined in terms of a tailored non-Euclidean geometry.<n>We propose computational strategies that enable both efficient training and fully parameter-free inference.
arXiv Detail & Related papers (2025-07-15T16:33:01Z)
An Accelerated Alternating Partial Bregman Algorithm for ReLU-based Matrix Decomposition [0.0]
In this paper, we aim to investigate the sparse low-rank characteristics rectified on non-negative matrices.<n>We propose a novel regularization term incorporating useful structures in clustering and compression tasks.<n>We derive corresponding closed-form solutions while maintaining the $L$-smooth property always holds for any $Lge 1$.
arXiv Detail & Related papers (2025-03-04T08:20:34Z)
Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum [56.37522020675243]
We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems. We show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks.
arXiv Detail & Related papers (2024-10-22T10:19:27Z)
Learning incomplete factorization preconditioners for GMRES [1.1519724914285523]
We train a graph neural network to approximate the matrix factorization directly.<n>Applying a graph neural network architecture allows us to ensure that the output itself is sparse.<n>We show their effectiveness in decreasing the number of GMRES iterations and improving the spectral properties on synthetic data.
arXiv Detail & Related papers (2024-09-12T17:55:44Z)
Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization [12.812942188697326]
Diffusion models have emerged as a powerful tool rivaling GANs in generating high-quality samples with improved fidelity, flexibility, and robustness. A key component of these models is to learn the score function through score matching. Despite empirical success on various tasks, it remains unclear whether gradient-based algorithms can learn the score function with a provable accuracy.
arXiv Detail & Related papers (2024-01-28T08:13:56Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties. We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z)
A Deep Unrolling Model with Hybrid Optimization Structure for Hyperspectral Image Deconvolution [50.13564338607482]
We propose a novel optimization framework for the hyperspectral deconvolution problem, called DeepMix.<n>It consists of three distinct modules, namely, a data consistency module, a module that enforces the effect of the handcrafted regularizers, and a denoising module.<n>This work proposes a context aware denoising module designed to sustain the advancements achieved by the cooperative efforts of the other modules.
arXiv Detail & Related papers (2023-06-10T08:25:16Z)
Efficient Bound of Lipschitz Constant for Convolutional Layers by Gram Iteration [122.51142131506639]
We introduce a precise, fast, and differentiable upper bound for the spectral norm of convolutional layers using circulant matrix theory. We show through a comprehensive set of experiments that our approach outperforms other state-of-the-art methods in terms of precision, computational cost, and scalability. It proves highly effective for the Lipschitz regularization of convolutional neural networks, with competitive results against concurrent approaches.
arXiv Detail & Related papers (2023-05-25T15:32:21Z)
Neural incomplete factorization: learning preconditioners for the conjugate gradient method [2.899792823251184]
We develop a data-driven approach to accelerate the generation of effective preconditioners. We replace the typically hand-engineered preconditioners by the output of graph neural networks. Our method generates an incomplete factorization of the matrix and is, therefore, referred to as neural incomplete factorization (NeuralIF)
arXiv Detail & Related papers (2023-05-25T11:45:46Z)
A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks [65.34977803841007]
Predictive coding networks are neuroscience-inspired models with roots in both Bayesian statistics and neuroscience. We show how by simply changing the temporal scheduling of the update rule for the synaptic weights leads to an algorithm that is much more efficient and stable than the original one.
arXiv Detail & Related papers (2022-11-16T00:11:04Z)
A Scalable, Adaptive and Sound Nonconvex Regularizer for Low-rank Matrix Completion [60.52730146391456]
We propose a new non scalable low-rank regularizer called "nuclear Frobenius norm" regularizer, which is adaptive and sound. It bypasses the computation of singular values and allows fast optimization by algorithms. It obtains state-of-the-art recovery performance while being the fastest in existing matrix learning methods.
arXiv Detail & Related papers (2020-08-14T18:47:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.