Related papers: Rolling Ball Optimizer: Learning by ironing out loss landscape wrinkles

Rolling Ball Optimizer: Learning by ironing out loss landscape wrinkles

URL: http://arxiv.org/abs/2505.19527v3
Date: Fri, 24 Oct 2025 04:55:44 GMT
Title: Rolling Ball Optimizer: Learning by ironing out loss landscape wrinkles
Authors: Mohammed Djameleddine Belgoumri, Mohamed Reda Bouadjenek, Hakim Hacid, Imran Razzak, Sunil Aryal,
Abstract summary: Training large neural networks (NNs) requires optimizing high-dimensional data-dependent loss functions.<n>These functions are often highly complex and textured, even fractal-like.<n>Noise in the training data can propagate forward and give rise to unrepresentative small-scale geometry.
Score: 19.667068548957143
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training large neural networks (NNs) requires optimizing high-dimensional data-dependent loss functions. The optimization landscape of these functions is often highly complex and textured, even fractal-like, with many spurious local minima, ill-conditioned valleys, degenerate points, and saddle points. Complicating things further is the fact that these landscape characteristics are a function of the data, meaning that noise in the training data can propagate forward and give rise to unrepresentative small-scale geometry. This poses a difficulty for gradient-based optimization methods, which rely on local geometry to compute updates and are, therefore, vulnerable to being derailed by noisy data. In practice,this translates to a strong dependence of the optimization dynamics on the noise in the data, i.e., poor generalization performance. To remediate this problem, we propose a new optimization procedure: Rolling Ball Optimizer (RBO), that breaks this spatial locality by incorporating information from a larger region of the loss landscape in its updates. We achieve this by simulating the motion of a rigid sphere of finite radius rolling on the loss landscape, a straightforward generalization of Gradient Descent (GD) that simplifies into it in the infinitesimal limit. The radius serves as a hyperparameter that determines the scale at which RBO sees the loss landscape, allowing control over the granularity of its interaction therewith. We are motivated by the intuition that the large-scale geometry of the loss landscape is less data-specific than its fine-grained structure, and that it is easier to optimize. We support this intuition by proving that our algorithm has a smoothing effect on the loss function. Evaluation against SGD, SAM, and Entropy-SGD, on MNIST and CIFAR-10/100 demonstrates promising results in terms of convergence speed, training accuracy, and generalization performance.

Related papers

Neural network optimization strategies and the topography of the loss landscape [45.88028371034407]
We investigate neural network learning by gradient descent (SGD)<n>We use several computational tools to investigate neural network parameters obtained by these two optimization methods.
arXiv Detail & Related papers (2026-02-24T17:49:13Z)
GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR [10.820638016337869]
We propose GeoRA, which exploits the anisotropic and compressible nature of RL update subspaces.<n>GeoRA mitigates optimization bottlenecks caused by geometric misalignment.<n>It consistently outperforms established low-rank baselines on key mathematical benchmarks.
arXiv Detail & Related papers (2026-01-14T10:41:34Z)
The Optimiser Hidden in Plain Sight: Training with the Loss Landscape's Induced Metric [0.0]
We present a class of novel optimisers for training neural networks.<n>The new optimiser has a computational complexity comparable to that of Adam.<n>One variant of these optimisers can also be viewed as inducing an effective scheduled learning rate.
arXiv Detail & Related papers (2025-09-03T18:00:33Z)
Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks [59.552873049024775]
We show that compute-optimally trained models exhibit a remarkably precise universality.<n>With learning rate decay, the collapse becomes so tight that differences in the normalized curves across models fall below the noise floor.<n>We explain these phenomena by connecting collapse to the power-law structure in typical neural scaling laws.
arXiv Detail & Related papers (2025-07-02T20:03:34Z)
QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization [69.50126552763157]
Surface reconstruction is fundamental to computer vision and graphics, enabling applications in 3D modeling, mixed reality, robotics, and more.<n>Existing approaches based on rendering obtain promising results, but optimize on a per-scene basis, resulting in a slow optimization that can struggle to model textureless regions.<n>We introduce QuickSplat, which learns data-driven priors to generate dense initializations for 2D gaussian splatting optimization of large-scale indoor scenes.
arXiv Detail & Related papers (2025-05-08T18:43:26Z)
Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture.<n>Non-smooth regularization is often incorporated into machine learning tasks.<n>We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z)
Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities [14.741581246137404]
We show that instabilities induced by large learning rates move model parameters toward flatter regions of the loss landscape.<n>We find these lead to excellent generalization performance on modern benchmark datasets.
arXiv Detail & Related papers (2024-12-23T14:32:53Z)
Deep Loss Convexification for Learning Iterative Models [11.36644967267829]
Iterative methods such as iterative closest point (ICP) for point cloud registration often suffer from bad local optimality. We propose learning to form a convex landscape around each ground truth.
arXiv Detail & Related papers (2024-11-16T01:13:04Z)
CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes [53.107474952492396]
CityGaussianV2 is a novel approach for large-scale scene reconstruction.<n>We implement a decomposed-gradient-based densification and depth regression technique to eliminate blurry artifacts and accelerate convergence.<n>Our method strikes a promising balance between visual quality, geometric accuracy, as well as storage and training costs.
arXiv Detail & Related papers (2024-11-01T17:59:31Z)
Dynamical loss functions shape landscape topography and improve learning in artificial neural networks [0.9208007322096533]
We show how to transform cross-entropy and mean squared error into dynamical loss functions. We show how they significantly improve validation accuracy for networks of varying sizes.
arXiv Detail & Related papers (2024-10-14T16:27:03Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Gradient constrained sharpness-aware prompt learning for vision-language models [99.74832984957025]
This paper targets a novel trade-off problem in generalizable prompt learning for vision-language models (VLM) By analyzing the loss landscapes of the state-of-the-art method and vanilla Sharpness-aware Minimization (SAM) based method, we conclude that the trade-off performance correlates to both loss value and loss sharpness. We propose a novel SAM-based method for prompt learning, denoted as Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp)
arXiv Detail & Related papers (2023-09-14T17:13:54Z)
Stabilizing Transformer Training by Preventing Attention Entropy Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers. We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training. We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z)
RAGO: Recurrent Graph Optimizer For Multiple Rotation Averaging [62.315673415889314]
This paper proposes a deep recurrent Rotation Averaging Graph (RAGO) for Multiple Rotation Averaging (MRA) Our framework is a real-time learning-to-optimize rotation averaging graph with a tiny size deployed for real-world applications.
arXiv Detail & Related papers (2022-12-14T13:19:40Z)
Understanding and Combating Robust Overfitting via Input Loss Landscape Analysis and Regularization [5.1024659285813785]
Adrial training is prone to overfitting, and the cause is far from clear. We find that robust overfitting results from standard training, specifically the minimization of the clean loss. We propose a new regularizer to smooth the loss landscape by penalizing the weighted logits variation along the adversarial direction.
arXiv Detail & Related papers (2022-12-09T16:55:30Z)
Adaptive Self-supervision Algorithms for Physics-informed Neural Networks [59.822151945132525]
Physics-informed neural networks (PINNs) incorporate physical knowledge from the problem domain as a soft constraint on the loss function. We study the impact of the location of the collocation points on the trainability of these models. We propose a novel adaptive collocation scheme which progressively allocates more collocation points to areas where the model is making higher errors.
arXiv Detail & Related papers (2022-07-08T18:17:06Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
Tilting the playing field: Dynamical loss functions for machine learning [18.831125493827766]
We show that learning can be improved by using loss functions that evolve cyclically during training to emphasize one class at a time. Improvement arises from the interplay of the changing loss landscape with the dynamics of the system as it evolves to minimize the loss.
arXiv Detail & Related papers (2021-02-07T13:15:08Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
The Break-Even Point on Optimization Trajectories of Deep Neural Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory. We show that using a large learning rate in the initial phase of training reduces the variance of the gradient. We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.