Related papers: Curvature in the Looking-Glass: Optimal Methods to Exploit Curvature of Expectation in the Loss Landscape

Curvature in the Looking-Glass: Optimal Methods to Exploit Curvature of Expectation in the Loss Landscape

URL: http://arxiv.org/abs/2411.16914v1
Date: Mon, 25 Nov 2024 20:32:57 GMT
Title: Curvature in the Looking-Glass: Optimal Methods to Exploit Curvature of Expectation in the Loss Landscape
Authors: Jed A. Duersch, Tommie A. Catanach, Alexander Safonov, Jeremy Wendt,
Abstract summary: We present a new conceptual framework to understand how curvature of expected changes in loss emerges in architectures with many rectified linear units. Our derivations show how these discontinuities combine to form a glass-like structure, similar to amorphous solids that contain microscopic domains of strong, but random, atomic alignment. We derive the optimal modification to quasi-Newton steps that incorporate both glass and Hessian terms, as well as certain exactness properties that are possible with Nesterov-accelerated gradient updates.
Score: 41.94295877935867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Harnessing the local topography of the loss landscape is a central challenge in advanced optimization tasks. By accounting for the effect of potential parameter changes, we can alter the model more efficiently. Contrary to standard assumptions, we find that the Hessian does not always approximate loss curvature well, particularly near gradient discontinuities, which commonly arise in deep learning architectures. We present a new conceptual framework to understand how curvature of expected changes in loss emerges in architectures with many rectified linear units. Each ReLU creates a parameter boundary that, when crossed, induces a pseudorandom gradient perturbation. Our derivations show how these discontinuities combine to form a glass-like structure, similar to amorphous solids that contain microscopic domains of strong, but random, atomic alignment. By estimating the density of the resulting gradient variations, we can bound how the loss may change with parameter movement. Our analysis includes the optimal kernel and sample distribution for approximating glass density from ordinary gradient evaluations. We also derive the optimal modification to quasi-Newton steps that incorporate both glass and Hessian terms, as well as certain exactness properties that are possible with Nesterov-accelerated gradient updates. Our algorithm, Alice, tests these techniques to determine which curvature terms are most impactful for training a given architecture and dataset. Additional safeguards enforce stable exploitation through step bounds that expand on the functionality of Adam. These theoretical and experimental tools lay groundwork to improve future efforts (e.g., pruning and quantization) by providing new insight into the loss landscape.

Related papers

Soft-Radial Projection for Constrained End-to-End Learning [2.3367876359631645]
We introduce Soft-Radial Projection, a differentiable re parameterization layer that circumvents gradient saturation.<n>This construction guarantees strict feasibility while preserving a full-rank Jacobian almost everywhere.<n>We empirically show improved convergence behavior and solution quality over state-of-the-art optimization- and projection-based baselines.
arXiv Detail & Related papers (2026-02-03T12:33:44Z)
Predictable Gradient Manifolds in Deep Learning: Temporal Path-Length and Intrinsic Rank as a Complexity Regime [0.0]
Empirically, along training trajectories are often temporally predictable and evolve within a low-dimensional subspace.<n>We formalize this observation through a measurable framework for predictable predictable dimension gradients.<n>We introduce new directions for adaptive gradients, rank-aware tracking, and prediction-based design grounded in measurable properties of real training runs.
arXiv Detail & Related papers (2026-01-07T11:23:55Z)
The Optimiser Hidden in Plain Sight: Training with the Loss Landscape's Induced Metric [0.0]
We present a class of novel optimisers for training neural networks.<n>The new optimiser has a computational complexity comparable to that of Adam.<n>One variant of these optimisers can also be viewed as inducing an effective scheduled learning rate.
arXiv Detail & Related papers (2025-09-03T18:00:33Z)
Curvature Learning for Generalization of Hyperbolic Neural Networks [51.888534247573894]
Hyperbolic neural networks (HNNs) have demonstrated notable efficacy in representing real-world data with hierarchical structures.<n>Inappropriate curvatures may cause HNNs to converge to suboptimal parameters, degrading overall performance.<n>We propose a sharpness-aware curvature learning method to smooth the loss landscape, thereby improving the generalization of HNNs.
arXiv Detail & Related papers (2025-08-24T07:14:30Z)
Data-Driven Adaptive Gradient Recovery for Unstructured Finite Volume Computations [0.0]
We present a novel data-driven approach for enhancing gradient reconstruction in unstructured finite volume methods for hyperbolic conservation laws.<n>Our approach extends previous structured-grid methodologies to unstructured meshes through a modified DeepONet architecture.<n>The proposed algorithm is faster and more accurate than the traditional second-order finite volume solver.
arXiv Detail & Related papers (2025-07-22T13:23:57Z)
Navigating loss manifolds via rigid body dynamics: A promising avenue for robustness and generalisation [11.729464930866483]
Training large neural networks through gradient-based optimization requires navigating high-dimensional loss landscapes.<n>We propose an alternative that simultaneously reduces this dependence, and avoids sharp minima.
arXiv Detail & Related papers (2025-05-26T05:26:21Z)
Rao-Blackwell Gradient Estimators for Equivariant Denoising Diffusion [55.95767828747407]
In domains such as molecular and protein generation, physical systems exhibit inherent symmetries that are critical to model.<n>We present a framework that reduces training variance and provides a provably lower-variance gradient estimator.<n>We also present a practical implementation of this estimator incorporating the loss and sampling procedure through a method we call Orbit Diffusion.
arXiv Detail & Related papers (2025-02-14T03:26:57Z)
Directional Smoothness and Gradient Methods: Convergence and Adaptivity [16.779513676120096]
We develop new sub-optimality bounds for gradient descent that depend on the conditioning of the objective along the path of optimization. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. We prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness.
arXiv Detail & Related papers (2024-03-06T22:24:05Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels [78.6096486885658]
We introduce lower bounds to the linearized Laplace approximation of the marginal likelihood. These bounds are amenable togradient-based optimization and allow to trade off estimation accuracy against computational complexity.
arXiv Detail & Related papers (2023-06-06T19:02:57Z)
Charting the Topography of the Neural Network Landscape with Thermal-Like Noise [0.0]
Training neural networks is a complex, high-dimensional, non- quadratic and noisy optimization problem. We use Langevin dynamics methods to study a classification task on random data network. We find that it is a low-dimensional dimension can be readily obtained from the fluctuations. We explain this behavior by a simplified loss model which is analytically tractable and reproduces the observed fluctuation statistics.
arXiv Detail & Related papers (2023-04-03T20:01:52Z)
Are Gradients on Graph Structure Reliable in Gray-box Attacks? [56.346504691615934]
Previous gray-box attackers employ gradients from the surrogate model to locate the vulnerable edges to perturb the graph structure. In this paper, we discuss and analyze the errors caused by the unreliability of the structural gradients. We propose a novel attack model with methods to reduce the errors inside the structural gradients.
arXiv Detail & Related papers (2022-08-07T06:43:32Z)
Error-Correcting Neural Networks for Two-Dimensional Curvature Computation in the Level-Set Method [0.0]
We present an error-neural-modeling-based strategy for approximating two-dimensional curvature in the level-set method. Our main contribution is a redesigned hybrid solver that relies on numerical schemes to enable machine-learning operations on demand.
arXiv Detail & Related papers (2022-01-22T05:14:40Z)
Learning High-Precision Bounding Box for Rotated Object Detection via Kullback-Leibler Divergence [100.6913091147422]
Existing rotated object detectors are mostly inherited from the horizontal detection paradigm. In this paper, we are motivated to change the design of rotation regression loss from induction paradigm to deduction methodology.
arXiv Detail & Related papers (2021-06-03T14:29:19Z)
Improved Analysis of Clipping Algorithms for Non-convex Optimization [19.507750439784605]
Recently, citetzhang 2019gradient show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD. Experiments confirm the superiority of clipping-based methods in deep learning tasks.
arXiv Detail & Related papers (2020-10-05T14:36:59Z)
Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties [18.973116252065278]
We propose a novel method called Expectigrad, which adjusts according to a per-component unweighted mean of all historical momentum term jointly between the numerator and denominator. We prove that Expectigrad cannot diverge on every instance of gradient optimization problem known to cause Adam to diverge.
arXiv Detail & Related papers (2020-10-03T13:34:27Z)
Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem. We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent. Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.