Related papers: Training Diagonal Linear Networks with Stochastic Sharpness-Aware Minimization

Training Diagonal Linear Networks with Stochastic Sharpness-Aware Minimization

URL: http://arxiv.org/abs/2503.11891v1
Date: Fri, 14 Mar 2025 21:45:12 GMT
Title: Training Diagonal Linear Networks with Stochastic Sharpness-Aware Minimization
Authors: Gabriel Clara, Sophie Langer, Johannes Schmidt-Hieber,
Abstract summary: We analyze the landscape and training dynamics of diagonal linear networks in a linear regression task.<n>We prove several results that relate its action on the underlying landscape and training dynamics to the sharpness of the loss.
Score: 7.032245866317619
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We analyze the landscape and training dynamics of diagonal linear networks in a linear regression task, with the network parameters being perturbed by small isotropic normal noise. The addition of such noise may be interpreted as a stochastic form of sharpness-aware minimization (SAM) and we prove several results that relate its action on the underlying landscape and training dynamics to the sharpness of the loss. In particular, the noise changes the expected gradient to force balancing of the weight matrices at a fast rate along the descent trajectory. In the diagonal linear model, we show that this equates to minimizing the average sharpness, as well as the trace of the Hessian matrix, among all possible factorizations of the same matrix. Further, the noise forces the gradient descent iterates towards a shrinkage-thresholding of the underlying true parameter, with the noise level explicitly regulating both the shrinkage factor and the threshold.

Related papers

Verification of Geometric Robustness of Neural Networks via Piecewise Linear Approximation and Lipschitz Optimisation [57.10353686244835]
We address the problem of verifying neural networks against geometric transformations of the input image, including rotation, scaling, shearing, and translation. The proposed method computes provably sound piecewise linear constraints for the pixel values by using sampling and linear approximations in combination with branch-and-bound Lipschitz. We show that our proposed implementation resolves up to 32% more verification cases than present approaches.
arXiv Detail & Related papers (2024-08-23T15:02:09Z)
On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.<n>We provide a proof of this in the case of linear neural networks with a squared loss.<n>We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z)
Parameter Symmetry and Noise Equilibrium of Stochastic Gradient Descent [8.347295051171525]
We show that gradient noise creates a systematic interplay of parameters $theta$ along the degenerate direction to a unique-independent fixed point $theta*$. These points are referred to as the it noise equilibria because, at these points, noise contributions from different directions are balanced and aligned. We show that the balance and alignment of gradient noise can serve as a novel alternative mechanism for explaining important phenomena such as progressive sharpening/flattening and representation formation within neural networks.
arXiv Detail & Related papers (2024-02-11T13:00:04Z)
Robust Stochastically-Descending Unrolled Networks [85.6993263983062]
Deep unrolling is an emerging learning-to-optimize method that unrolls a truncated iterative algorithm in the layers of a trainable neural network. We show that convergence guarantees and generalizability of the unrolled networks are still open theoretical problems. We numerically assess unrolled architectures trained under the proposed constraints in two different applications.
arXiv Detail & Related papers (2023-12-25T18:51:23Z)
A Theoretical Analysis of Noise Geometry in Stochastic Gradient Descent [9.064667124987068]
Minibatch gradient descent (SGD) is a geometry phenomenon where noise aligns favorably with the geometry of local landscape. We propose two metrics, derived from analyzing how noise influences the loss and subspace projection dynamics, to quantify the alignment strength.
arXiv Detail & Related papers (2023-10-01T14:58:20Z)
Implicit Regularization for Group Sparsity [33.487964460794764]
We show that gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure. We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates. In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression.
arXiv Detail & Related papers (2023-01-29T20:54:03Z)
The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima [41.961056785108845]
We consider Sharpness-Aware Minimization, a gradient-based optimization method for deep networks. We show that when SAM is applied with a convex quadratic objective, it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature. In the non-quadratic case, we show that such oscillations effectively perform gradient descent, with a smaller step-size, on the spectral norm of the Hessian.
arXiv Detail & Related papers (2022-10-04T10:34:37Z)
Memory-Efficient Backpropagation through Large Linear Layers [107.20037639738433]
In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass. This study proposes a memory reduction approach to perform backpropagation through linear layers.
arXiv Detail & Related papers (2022-01-31T13:02:41Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
On regularization of gradient descent, layer imbalance and flat minima [9.08659783613403]
We analyze the training dynamics for deep linear networks using a new metric - imbalance - which defines the flatness of a solution. We demonstrate that different regularization methods, such as weight decay or noise data augmentation, behave in a similar way.
arXiv Detail & Related papers (2020-07-18T00:09:14Z)
Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated. We propose a new method for this estimation problem combining sampling and analytic approximation steps. We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
On the Convex Behavior of Deep Neural Networks in Relation to the Layers' Width [99.24399270311069]
We observe that for wider networks, minimizing the loss with the descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between. In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G.
arXiv Detail & Related papers (2020-01-14T16:30:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.