Related papers: What Really Matters in Matrix-Whitening Optimizers?

What Really Matters in Matrix-Whitening Optimizers?

URL: http://arxiv.org/abs/2510.25000v1
Date: Tue, 28 Oct 2025 21:59:49 GMT
Title: What Really Matters in Matrix-Whitening Optimizers?
Authors: Kevin Frans, Pieter Abbeel, Sergey Levine,
Abstract summary: We show that matrix-whitening methods reliably outperform elementwise counterparts.<n>Veto-adapted versions consistently outperform their sign-descent counterparts.<n>Low-rank variance estimators can effectively reduce memory costs without a performance loss.
Score: 99.7641280234926
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A range of recent optimizers have emerged that approximate the same "matrix-whitening" transformation in various ways. In this work, we systematically deconstruct such optimizers, aiming to disentangle the key components that explain performance. Across tuned hyperparameters across the board, all flavors of matrix-whitening methods reliably outperform elementwise counterparts, such as Adam. Matrix-whitening is often related to spectral descent -- however, experiments reveal that performance gains are *not explained solely by accurate spectral normalization* -- particularly, SOAP displays the largest per-step gain, even though Muon more accurately descends along the steepest spectral descent direction. Instead, we argue that matrix-whitening serves two purposes, and the variance adaptation component of matrix-whitening is the overlooked ingredient explaining this performance gap. Experiments show that variance-adapted versions of optimizers consistently outperform their sign-descent counterparts, including an adaptive version of Muon. We further ablate variance adaptation strategies, finding that while lookahead style approximations are not as effective, low-rank variance estimators can effectively reduce memory costs without a performance loss.

Related papers

Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization [40.95701844244596]
We show that ZO optimization can be substantially improved by unifying two complementary principles.<n>We instantiate in a new method, ZO-Muon, admitting a natural interpretation as a low-rank Muon in the ZO setting.
arXiv Detail & Related papers (2026-02-19T08:08:33Z)
Symmetry Breaking in Transformers for Efficient and Interpretable Training [5.624886369964602]
We introduce a simple symmetry-breaking protocol that inserts a preferred direction into a rotational space through batchwise-sampled, unlearned query and value biases.<n>First, it can substantially improve the performance of simple, memory-efficients.<n>Second, it enables an interpretable use of otherwise redundant rotational degrees of freedom.
arXiv Detail & Related papers (2026-01-29T19:29:09Z)
Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation [26.926297904648393]
Deep neural networks are susceptible to learn spurious correlations present in datasets.<n>We quantify the linear correlation between the target and bias features by the covariance matrix, and eliminate it through the whitening module.<n>We show that our method outperforms existing approaches on four benchmark datasets.
arXiv Detail & Related papers (2025-07-27T14:01:30Z)
Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy [57.54306942529943]
We propose an Approximately Orthogonal Fine-Tuning (AOFT) strategy for representing the low-rank weight matrices.<n>Our method achieves competitive performance across a range of downstream image classification tasks.
arXiv Detail & Related papers (2025-07-17T16:09:05Z)
DiffoRA: Enabling Parameter-Efficient Fine-Tuning via Differential Module Selection [32.369133126167085]
Low-Rank Adaptation (LoRA) has gained popularity for its streamlined design by incorporating low-rank matrices into existing pre-trained models.<n>We propose DiffoRA, which enables adaptive adoption of the low-rank decomposition matrices.
arXiv Detail & Related papers (2025-02-13T02:41:34Z)
Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage. To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices. A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv Detail & Related papers (2024-11-04T04:58:20Z)
Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation [53.88562288388169]
A common strategy for. Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks. We propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix. SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix.
arXiv Detail & Related papers (2024-10-30T12:08:30Z)
Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models [73.88009808326387]
We propose a novel spectrum-aware adaptation framework for generative models. Our method adjusts both singular values and their basis vectors of pretrained weights. We introduce Spectral Ortho Decomposition Adaptation (SODA), which balances computational efficiency and representation capacity.
arXiv Detail & Related papers (2024-05-31T17:43:35Z)
AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix [8.975415409709575]
We propose a novel approach to designing the preconditioning matrix by utilizing the gradient difference between two successive steps as the diagonal elements.<n>We evaluate AGD on public generalization of Natural Language Computer Vision (CV), and Recommendation Systems (RecSys)
arXiv Detail & Related papers (2023-12-04T06:20:14Z)
Improving Generalization of Batch Whitening by Convolutional Unit Optimization [24.102442375834084]
Batch Whitening is a technique that accelerates and stabilizes training by transforming input features to have a zero mean (Centering) and a unit variance (Scaling) In commonly used structures, which are empirically optimized with Batch Normalization, the normalization layer appears between convolution and activation function. We propose a new Convolutional Unit that is in line with the theory, and our method generally improves the performance of Batch Whitening.
arXiv Detail & Related papers (2021-08-24T10:27:57Z)
Understanding Implicit Regularization in Over-Parameterized Single Index Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model. We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.