Related papers: Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?

Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?

URL: http://arxiv.org/abs/2511.00674v1
Date: Sat, 01 Nov 2025 19:37:29 GMT
Title: Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?
Authors: Weijie Su,
Abstract summary: We introduce a model for analyzing deep learning optimization over a single iteration by leveraging the matrix structure of the weights.<n>We show that the optimal update matrix is obtained by making the spectrum of the original gradient matrix more homogeneous.<n>We discuss future research on how to leverage the isotropic curvature model for designing new optimization methods for training deep learning and language models.
Score: 2.5336146096572745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce a model for analyzing deep learning optimization over a single iteration by leveraging the matrix structure of the weights. We derive the model by assuming isotropy of curvature, including the second-order Hessian and higher-order terms, of the loss function across all perturbation directions; hence, we call it the isotropic curvature model. This model is a convex optimization program amenable to analysis, which allows us to understand how an update on the weights in the form of a matrix relates to the change in the total loss function. As an application, we use the isotropic curvature model to analyze the recently introduced Muon optimizer and other matrix-gradient methods for training language models. First, we show that under a general growth condition on the curvature, the optimal update matrix is obtained by making the spectrum of the original gradient matrix more homogeneous -- that is, making its singular values closer in ratio -- which in particular improves the conditioning of the update matrix. Next, we show that the orthogonalized gradient becomes optimal for the isotropic curvature model when the curvature exhibits a phase transition in growth. Taken together, these results suggest that the gradient orthogonalization employed in Muon and other related methods is directionally correct but may not be strictly optimal. Finally, we discuss future research on how to leverage the isotropic curvature model for designing new optimization methods for training deep learning and language models.

Related papers

Preventing Model Collapse Under Overparametrization: Optimal Mixing Ratios for Interpolation Learning and Ridge Regression [4.71547360356314]
Model collapse occurs when generative models degrade after repeatedly training on their own synthetic outputs.<n>We derive precise error formulae for minimum-$ell$-norm and ridge regression under this iterative scheme.<n>Our analysis reveals intriguing properties of the mixing weight that minimizes long-term prediction error.
arXiv Detail & Related papers (2025-09-26T13:34:48Z)
Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness [51.302674884611335]
This work introduces a hybrid non-Euclidean optimization method which generalizes norm clipping by combining steepest descent and conditional gradient approaches.<n>We discuss how to instantiate the algorithms for deep learning and demonstrate their properties on image classification and language modeling.
arXiv Detail & Related papers (2025-06-02T17:34:29Z)
Scaling and renormalization in high-dimensional regression [72.59731158970894]
We present a unifying perspective on recent results on ridge regression.<n>We use the basic tools of random matrix theory and free probability, aimed at readers with backgrounds in physics and deep learning.<n>Our results extend and provide a unifying perspective on earlier models of scaling laws.
arXiv Detail & Related papers (2024-05-01T15:59:00Z)
Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective. We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices. Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z)
FineMorphs: Affine-diffeomorphic sequences for regression [1.1421942894219896]
The model states are optimally "reshaped" by diffeomorphisms generated by smooth vector fields during learning. Affine transformations and vector fields are optimized within an optimal control setting. The model can naturally reduce (or increase) dimensionality and adapt to large datasets via suboptimal vector fields.
arXiv Detail & Related papers (2023-05-26T20:54:18Z)
Linear Convergence of ISTA and FISTA [8.261388753972234]
We revisit the class of iterative shrinkage-thresholding algorithms (ISTA) for solving the linear inverse problem with sparse representation. We find that the previous assumption for the smooth part to be convex weakens the least-square model. We generalize the linear convergence to composite optimization in both the objective value and the squared proximal subgradient norm.
arXiv Detail & Related papers (2022-12-13T02:02:50Z)
Learning Graphical Factor Models with Riemannian Optimization [70.13748170371889]
This paper proposes a flexible algorithmic framework for graph learning under low-rank structural constraints. The problem is expressed as penalized maximum likelihood estimation of an elliptical distribution. We leverage geometries of positive definite matrices and positive semi-definite matrices of fixed rank that are well suited to elliptical models.
arXiv Detail & Related papers (2022-10-21T13:19:45Z)
Graph Polynomial Convolution Models for Node Classification of Non-Homophilous Graphs [52.52570805621925]
We investigate efficient learning from higher-order graph convolution and learning directly from adjacency matrix for node classification. We show that the resulting model lead to new graphs and residual scaling parameter. We demonstrate that the proposed methods obtain improved accuracy for node-classification of non-homophilous parameters.
arXiv Detail & Related papers (2022-09-12T04:46:55Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
Geometry, Computation, and Optimality in Stochastic Optimization [24.154336772159745]
We study computational and statistical consequences of problem geometry in and online optimization.<n>By focusing on constraint set and gradient geometry, we characterize the problem families for which- and adaptive-gradient methods are (minimax) optimal.
arXiv Detail & Related papers (2019-09-23T16:14:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.