Related papers: Error Feedback for Muon and Friends

Error Feedback for Muon and Friends

URL: http://arxiv.org/abs/2510.00643v1
Date: Wed, 01 Oct 2025 08:20:08 GMT
Title: Error Feedback for Muon and Friends
Authors: Kaja Gruntkowska, Alexander Gaponov, Zhirayr Tovmasyan, Peter Richtárik,
Abstract summary: We introduce EF21-Muon, the first communication-efficient, non-Euclidean LMO-based with rigorous convergence guarantees.<n>Our theory covers non-Euclidean smooth and the more general $(L0, L1)$-smooth setting, matching best-known Euclidean rates and enabling faster convergence under suitable norm choices.
Score: 80.90330715662961
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Recent optimizers like Muon, Scion, and Gluon have pushed the frontier of large-scale deep learning by exploiting layer-wise linear minimization oracles (LMOs) over non-Euclidean norm balls, capturing neural network structure in ways traditional algorithms cannot. Yet, no principled distributed framework exists for these methods, and communication bottlenecks remain unaddressed. The very few distributed variants are heuristic, with no convergence guarantees in sight. We introduce EF21-Muon, the first communication-efficient, non-Euclidean LMO-based optimizer with rigorous convergence guarantees. EF21-Muon supports stochastic gradients, momentum, and bidirectional compression with error feedback-marking the first extension of error feedback beyond the Euclidean setting. It recovers Muon/Scion/Gluon when compression is off and specific norms are chosen, providing the first efficient distributed implementation of this powerful family. Our theory covers non-Euclidean smooth and the more general $(L^0, L^1)$-smooth setting, matching best-known Euclidean rates and enabling faster convergence under suitable norm choices. We further extend the analysis to layer-wise (generalized) smoothness regimes, capturing the anisotropic structure of deep networks. Experiments on NanoGPT benchmarking EF21-Muon against uncompressed Muon/Scion/Gluon demonstrate up to $7\times$ communication savings with no accuracy degradation.

Related papers

Regularized Online RLHF with Generalized Bilinear Preferences [68.44113000390544]
We consider the problem of contextual online RLHF with general preferences.<n>We adopt the Generalized Bilinear Preference Model to capture preferences via low-rank, skew-symmetric matrices.<n>We prove that the dual gap of the greedy policy is bounded by the square of the estimation error.
arXiv Detail & Related papers (2026-02-26T15:27:53Z)
Unregularized Linear Convergence in Zero-Sum Game from Preference Feedback [50.89125374999765]
We provide the first convergence guarantee for Optimistic Multiplicative Weights Update ($mathtOMWU$) in NLHF.<n>Our analysis identifies a novel marginal convergence behavior, where the probability of rarely played actions grows exponentially from exponentially small values.
arXiv Detail & Related papers (2025-12-31T12:08:29Z)
Muon is Provably Faster with Momentum Variance Reduction [55.388203260208485]
Recent empirical research has demonstrated that deep learnings based on the linear linear oracle (LMO) over specifically chosen Non-Eudean.<n>Adam-type training methods outperform the minimization of large language models.
arXiv Detail & Related papers (2025-12-18T14:38:39Z)
Better LMO-based Momentum Methods with Second-Order Information [48.580700968416444]
Hessian-Corrected Momentum (HCM) aims to improve momentum convergence rates.<n>Hessian-Corrected Momentum can adapt to the geometry of the problem and achieve a faster rate than traditional momentum.<n>We extend the Linear Minimization Oracle framework by integrating HCM, and provide convergence guarantees under relaxed smoothness and arbitrary norm settings.
arXiv Detail & Related papers (2025-12-15T11:43:09Z)
Deeper with Riemannian Geometry: Overcoming Oversmoothing and Oversquashing for Graph Foundation Models [47.23316001059971]
Message Passing Neural Networks (MPNNs) are building block of graph foundation models.<n>MPNNs suffer from oversmoothing and oversquashing.<n>We propose a textbflocal approach that adjusts message passing based on local structures.
arXiv Detail & Related papers (2025-10-20T11:41:45Z)
On the Convergence of Muon and Beyond [31.900178928104648]
The Muon has demonstrated remarkable success in matrix-structured parameters for neural networks.<n>A significant understanding gap persists between its theoretical and practical rate variants.<n>This work provides the first proof of optimality for a Muon-style and corroborates our findings on per-it convergence.
arXiv Detail & Related papers (2025-09-19T09:43:37Z)
Greedy Low-Rank Gradient Compression for Distributed Learning with Convergence Guarantees [10.828702910680692]
We propose the first Greedy Low-Rank compression algorithm for distributed learning with rigorous convergence guarantees.<n>We prove that GreedyLore achieves a convergence rate of $mathcalO(sigma/sqrtNT + 1/T)$ under standards such as MSGD and Adam--marking the first linear speedup convergence rate for low-rank gradient compression.
arXiv Detail & Related papers (2025-07-11T17:46:12Z)
Muon Optimizes Under Spectral Norm Constraints [12.29696026957078]
We show that Muon implicitly solves an optimization problem that enforces a constraint on the spectral norm of weight matrices.<n>This perspective allows for the exploration of a broader class of implicitly regularized and constrained optimization algorithms.
arXiv Detail & Related papers (2025-06-18T01:32:39Z)
Smoothed Normalization for Efficient Distributed Private Optimization [54.197255548244705]
Federated learning enables machine learning models with privacy of participants.<n>There is no differentially private distributed method for training, non-feedback problems.<n>We introduce a new distributed algorithm $alpha$-$sf NormEC$ with provable convergence guarantees.
arXiv Detail & Related papers (2025-02-19T07:10:32Z)
Mirror Descent Under Generalized Smoothness [23.5387392871236]
We introduce a new $ell*$-smoothness concept that measures the norm of Hessians in terms of a general norm and its dual.<n>We establish convergence for mirror-descent-type algorithms, matching the rates under the classic smoothness.
arXiv Detail & Related papers (2025-02-02T11:23:10Z)
MARINA-P: Superior Performance in Non-smooth Federated Optimization with Adaptive Stepsizes [57.24311218570012]
We extend the non-smooth convex theory of EF21-P (Anonymous 2024) and MARINA-P (arXiv:2402.06412) in the non-size convex setting.<n>We provide theoretical guarantees under constant, decreasing, and adaptive (aktypetype) steps.
arXiv Detail & Related papers (2024-12-22T16:18:34Z)
DFedADMM: Dual Constraints Controlled Model Inconsistency for Decentralized Federated Learning [52.83811558753284]
Decentralized learning (DFL) discards the central server and establishes a decentralized communication network. Existing DFL methods still suffer from two major challenges: local inconsistency and local overfitting.
arXiv Detail & Related papers (2023-08-16T11:22:36Z)
Gradient-Free Methods for Deterministic and Stochastic Nonsmooth Nonconvex Optimization [94.19177623349947]
Non-smooth non optimization problems emerge in machine learning and business making. Two core challenges impede the development of efficient methods with finitetime convergence guarantee. Two-phase versions of GFM and SGFM are also proposed and proven to achieve improved large-deviation results.
arXiv Detail & Related papers (2022-09-12T06:53:24Z)
Log-based Sparse Nonnegative Matrix Factorization for Data Representation [55.72494900138061]
Nonnegative matrix factorization (NMF) has been widely studied in recent years due to its effectiveness in representing nonnegative data with parts-based representations. We propose a new NMF method with log-norm imposed on the factor matrices to enhance the sparseness. A novel column-wisely sparse norm, named $ell_2,log$-(pseudo) norm, is proposed to enhance the robustness of the proposed method.
arXiv Detail & Related papers (2022-04-22T11:38:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.