Related papers: The Optimiser Hidden in Plain Sight: Training with the Loss Landscape's Induced Metric

The Optimiser Hidden in Plain Sight: Training with the Loss Landscape's Induced Metric

URL: http://arxiv.org/abs/2509.03594v1
Date: Wed, 03 Sep 2025 18:00:33 GMT
Title: The Optimiser Hidden in Plain Sight: Training with the Loss Landscape's Induced Metric
Authors: Thomas R. Harvey,
Abstract summary: We present a class of novel optimisers for training neural networks.<n>The new optimiser has a computational complexity comparable to that of Adam.<n>One variant of these optimisers can also be viewed as inducing an effective scheduled learning rate.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a class of novel optimisers for training neural networks that makes use of the Riemannian metric naturally induced when the loss landscape is embedded in higher-dimensional space. This is the same metric that underlies common visualisations of loss landscapes. By taking this geometric perspective literally and using the induced metric, we develop a new optimiser and compare it to existing methods, namely: SGD, Adam, AdamW, and Muon, across a range of tasks and architectures. Empirically, we conclude that this new class of optimisers is highly effective in low dimensional examples, and provides slight improvement over state-of-the-art methods for training neural networks. These new optimisers have theoretically desirable properties. In particular, the effective learning rate is automatically decreased in regions of high curvature acting as a smoothed out form of gradient clipping. Similarly, one variant of these optimisers can also be viewed as inducing an effective scheduled learning rate and decoupled weight decay is the natural choice from our geometric perspective. The basic method can be used to modify any existing preconditioning method. The new optimiser has a computational complexity comparable to that of Adam.

Related papers

A geometric framework for momentum-based optimizers for low-rank training [2.389598109913754]
Low-rank pre-training and fine-tuning have emerged as promising techniques for reducing the computational and storage costs of large neural networks.<n>We show that classical momentum methods can struggle to converge to a local optimum due to the geometry of the underlying optimization landscape.<n>We introduce novel training strategies derived from dynamical low-rank approximation, which explicitly account for the underlying geometric structure.
arXiv Detail & Related papers (2025-06-20T20:46:01Z)
Rolling Ball Optimizer: Learning by ironing out loss landscape wrinkles [19.667068548957143]
Training large neural networks (NNs) requires optimizing high-dimensional data-dependent loss functions.<n>These functions are often highly complex and textured, even fractal-like.<n>Noise in the training data can propagate forward and give rise to unrepresentative small-scale geometry.
arXiv Detail & Related papers (2025-05-26T05:26:21Z)
Deep Learning Optimization Using Self-Adaptive Weighted Auxiliary Variables [20.09691024284159]
In this paper, we develop a new framework for learning via neural networks or physics-informed networks.<n>The robustness of our framework guarantees that the new loss helps optimize the original problem.
arXiv Detail & Related papers (2025-04-30T10:43:13Z)
Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy [75.15685966213832]
We analyze the rich directional structure of optimization trajectories represented by their pointwise parameters. We show that training only scalar batchnorm parameters some while into training matches the performance of training the entire network.
arXiv Detail & Related papers (2024-03-12T07:32:47Z)
Efficient and Flexible Neural Network Training through Layer-wise Feedback Propagation [49.44309457870649]
Layer-wise Feedback feedback (LFP) is a novel training principle for neural network-like predictors.<n>LFP decomposes a reward to individual neurons based on their respective contributions.<n>Our method then implements a greedy reinforcing approach helpful parts of the network and weakening harmful ones.
arXiv Detail & Related papers (2023-08-23T10:48:28Z)
No Wrong Turns: The Simple Geometry Of Neural Networks Optimization Paths [12.068608358926317]
First-order optimization algorithms are known to efficiently locate favorable minima in deep neural networks. We focus on the fundamental geometric properties of sampled quantities of optimization on two key paths. Our findings suggest that not only do optimization trajectories never encounter significant obstacles, but they also maintain stable dynamics during the majority of training.
arXiv Detail & Related papers (2023-06-20T22:10:40Z)
Towards Theoretically Inspired Neural Initialization Optimization [66.04735385415427]
We propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network. We show that both the training and test performance of a network can be improved by maximizing GradCosine under norm constraint. Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost.
arXiv Detail & Related papers (2022-10-12T06:49:16Z)
Learning to Optimize Quasi-Newton Methods [22.504971951262004]
This paper introduces a novel machine learning called LODO, which tries to online meta-learn the best preconditioner during optimization. Unlike other L2O methods, LODO does not require any meta-training on a training task distribution. We show that our gradient approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians.
arXiv Detail & Related papers (2022-10-11T03:47:14Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models. Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely. Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.