Related papers: FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer

FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer

URL: http://arxiv.org/abs/2601.21750v1
Date: Thu, 29 Jan 2026 14:05:04 GMT
Title: FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer
Authors: Chenrui Xu, Wenjing Yan, Ying-Jun Angela Zhang,
Abstract summary: We introduce FISMO, which incorporates anisotropic neuralotropic geometry information through Fisher information geometry.<n> FISMO achieves superior efficiency and final performance compared to established baselines.
Score: 30.184978506988767
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training large-scale neural networks requires solving nonconvex optimization where the choice of optimizer fundamentally determines both convergence behavior and computational efficiency. While adaptive methods like Adam have long dominated practice, the recently proposed Muon optimizer achieves superior performance through orthogonalized momentum updates that enforce isotropic geometry with uniform singular values. However, this strict isotropy discards potentially valuable curvature information encoded in gradient spectra, motivating optimization methods that balance geometric structure with adaptivity. We introduce FISMO (Fisher-Structured Momentum-Orthogonalized) optimizer, which generalizes isotropic updates to incorporate anisotropic curvature information through Fisher information geometry. By reformulating the optimizer update as a trust-region problem constrained by a Kronecker-factored Fisher metric, FISMO achieves structured preconditioning that adapts to local loss landscape geometry while maintaining computational tractability. We establish convergence guarantees for FISMO in stochastic nonconvex settings, proving an $\mathcal{O}(1/\sqrt{T})$ rate for the expected squared gradient norm with explicit characterization of variance reduction through mini-batching. Empirical evaluation on image classification and language modeling benchmarks demonstrates that FISMO achieves superior training efficiency and final performance compared to established baselines.

Related papers

ODELoRA: Training Low-Rank Adaptation by Solving Ordinary Differential Equations [54.886931928255564]
Low-rank adaptation (LoRA) has emerged as a widely adopted parameter-efficient fine-tuning method in deep transfer learning.<n>We propose a novel continuous-time optimization dynamic for LoRA factor matrices in the form of an ordinary differential equation (ODE)<n>We show that ODELoRA achieves stable feature learning, a property that is crucial for training deep neural networks at different scales of problem dimensionality.
arXiv Detail & Related papers (2026-02-07T10:19:36Z)
Variational Entropic Optimal Transport [67.76725267984578]
We propose Variational Entropic Optimal Transport (VarEOT) for domain translation problems.<n>VarEOT is based on an exact variational reformulation of the log-partition $log mathbbE[exp(cdot)$ as a tractable generalization over an auxiliary positive normalizer.<n> Experiments on synthetic data and unpaired image-to-image translation demonstrate competitive or improved translation quality.
arXiv Detail & Related papers (2026-02-02T15:48:44Z)
Majorization-Minimization Networks for Inverse Problems: An Application to EEG Imaging [4.063392865490957]
Inverse problems are often ill-posed and require optimization schemes with strong stability and convergence guarantees.<n>We propose a learned Majorization-Minimization (MM) framework for inverse problems within a bilevel optimization setting.<n>We learn a structured curvature majorant that governs each MM step while preserving classical MM descent guarantees.
arXiv Detail & Related papers (2026-01-23T10:33:45Z)
NOVAK: Unified adaptive optimizer for deep neural networks [0.0]
NOVAK is a gradient-based optimization algorithm that integrates adaptive moment estimation, rectified learning-rate scheduling, decoupled weight regularization, multiple variants of Nesterov momentum, and lookahead synchronization into a unified, performance-oriented framework.
arXiv Detail & Related papers (2026-01-11T13:03:57Z)
Data-Driven Adaptive Gradient Recovery for Unstructured Finite Volume Computations [0.0]
We present a novel data-driven approach for enhancing gradient reconstruction in unstructured finite volume methods for hyperbolic conservation laws.<n>Our approach extends previous structured-grid methodologies to unstructured meshes through a modified DeepONet architecture.<n>The proposed algorithm is faster and more accurate than the traditional second-order finite volume solver.
arXiv Detail & Related papers (2025-07-22T13:23:57Z)
Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture.<n>Non-smooth regularization is often incorporated into machine learning tasks.<n>We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z)
AYLA: Amplifying Gradient Sensitivity via Loss Transformation in Non-Convex Optimization [0.0]
Gradient Descent (SGD) and its variants, such as ADAM, are foundational to deep learning optimization.<n>This paper introduces AYLA, a novel framework that enhances dynamics training.
arXiv Detail & Related papers (2025-04-02T16:31:39Z)
Gradient Correction in Federated Learning with Adaptive Optimization [19.93709245766609]
We propose tt FAdamGC, the first algorithm to integrate client-drift compensation into adaptive optimization.<n>We show that tt FAdamGC consistently outperforms existing methods in total communication and cost across varying levels of data.
arXiv Detail & Related papers (2025-02-04T21:21:30Z)
Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
LQF: Linear Quadratic Fine-Tuning [114.3840147070712]
We present the first method for linearizing a pre-trained model that achieves comparable performance to non-linear fine-tuning. LQF consists of simple modifications to the architecture, loss function and optimization typically used for classification.
arXiv Detail & Related papers (2020-12-21T06:40:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.