Related papers: A Minimalist Example of Edge-of-Stability and Progressive Sharpening

A Minimalist Example of Edge-of-Stability and Progressive Sharpening

URL: http://arxiv.org/abs/2503.02809v1
Date: Tue, 04 Mar 2025 17:35:13 GMT
Title: A Minimalist Example of Edge-of-Stability and Progressive Sharpening
Authors: Liming Liu, Zixuan Zhang, Simon Du, Tuo Zhao,
Abstract summary: Edge of Stability (EoS) and Progressive Sharpening (PS) are challenging classical Gradient Descent analyses.<n>This paper introduces a two-layer network with a two-dimensional input, where one dimension is relevant to the response and the other is irrelevant.<n>We prove the existence of progressive sharpening and self-stabilization under large learning rates, and establish non-asymptotic analysis of the training dynamics and sharpness.
Score: 40.35175786562617
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in deep learning optimization have unveiled two intriguing phenomena under large learning rates: Edge of Stability (EoS) and Progressive Sharpening (PS), challenging classical Gradient Descent (GD) analyses. Current research approaches, using either generalist frameworks or minimalist examples, face significant limitations in explaining these phenomena. This paper advances the minimalist approach by introducing a two-layer network with a two-dimensional input, where one dimension is relevant to the response and the other is irrelevant. Through this model, we rigorously prove the existence of progressive sharpening and self-stabilization under large learning rates, and establish non-asymptotic analysis of the training dynamics and sharpness along the entire GD trajectory. Besides, we connect our minimalist example to existing works by reconciling the existence of a well-behaved ``stable set" between minimalist and generalist analyses, and extending the analysis of Gradient Flow Solution sharpness to our two-dimensional input scenario. These findings provide new insights into the EoS phenomenon from both parameter and input data distribution perspectives, potentially informing more effective optimization strategies in deep learning practice.

Related papers

Understanding Sharpness Dynamics in NN Training with a Minimalist Example: The Effects of Dataset Difficulty, Depth, Stochasticity, and More [10.65078014704416]
When training deep neural networks with sharpness often increases, before saturating at the edge of stability.<n>In this work, we study this phenomenon using a minimalist model: a deep linear network with a single neuron per layer.<n>We show that this simple model effectively captures the sharpness dynamics observed in recent empirical studies, offering a simple testbed to better understand neural network training.
arXiv Detail & Related papers (2025-06-07T22:35:13Z)
Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond [61.18736646013446]
In pursuit of a deeper understanding of its surprising behaviors, we investigate the utility of a simple yet accurate model of a trained neural network. Across three case studies, we illustrate how it can be applied to derive new empirical insights on a diverse range of prominent phenomena.
arXiv Detail & Related papers (2024-10-31T22:54:34Z)
Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization [36.72245290832128]
We identify a new phenomenon in neural network optimization which arises from the interaction of depth and a heavytailed structure in natural data. In particular, it implies a conceptually new cause for progressive sharpening and the edge of stability. We demonstrate the significant influence of paired groups of outliers in the training data with strong opposing signals.
arXiv Detail & Related papers (2023-11-07T17:43:50Z)
Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos [5.854190253899593]
In gradient descent dynamics of neural networks, the top eigenvalue of the loss Hessian (sharpness) displays a variety of robust phenomena throughout training.<n>We demonstrate that a simple $2$-layer linear network (UV model) trained on a single training example exhibits all of the essential sharpness phenomenology observed in real-world scenarios.
arXiv Detail & Related papers (2023-11-03T17:59:40Z)
Gradient constrained sharpness-aware prompt learning for vision-language models [99.74832984957025]
This paper targets a novel trade-off problem in generalizable prompt learning for vision-language models (VLM) By analyzing the loss landscapes of the state-of-the-art method and vanilla Sharpness-aware Minimization (SAM) based method, we conclude that the trade-off performance correlates to both loss value and loss sharpness. We propose a novel SAM-based method for prompt learning, denoted as Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp)
arXiv Detail & Related papers (2023-09-14T17:13:54Z)
Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory [14.141453107129403]
We study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent trajectory. The sharpness increases at the early phase of training, and eventually saturates close to the threshold of $2 / text(step size)$.
arXiv Detail & Related papers (2023-07-09T15:16:45Z)
Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning. GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients. This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z)
Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms. We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework. Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.