Related papers: A Unified Noise-Curvature View of Loss of Trainability

A Unified Noise-Curvature View of Loss of Trainability

URL: http://arxiv.org/abs/2509.19698v1
Date: Wed, 24 Sep 2025 02:11:13 GMT
Title: A Unified Noise-Curvature View of Loss of Trainability
Authors: Gunbir Singh Baveja, Mark Schmidt,
Abstract summary: Loss of trainability (LoT) in continual learning occurs when steps no longer yield improvement as tasks evolve.<n>We introduce two complementary criteria: a batch-size-aware gradient-noise bound and a curvature volatility-controlled bound.<n>Using this threshold, we build a simple per-layer scheduler that keeps each layers effective step below a safe limit.
Score: 8.602734307457387
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Loss of trainability (LoT) in continual learning occurs when gradient steps no longer yield improvement as tasks evolve, so accuracy stalls or degrades despite adequate capacity and supervision. We analyze LoT incurred with Adam through an optimization lens and find that single indicators such as Hessian rank, sharpness level, weight or gradient norms, gradient-to-parameter ratios, and unit-sign entropy are not reliable predictors. Instead we introduce two complementary criteria: a batch-size-aware gradient-noise bound and a curvature volatility-controlled bound that combine into a per-layer predictive threshold that anticipates trainability behavior. Using this threshold, we build a simple per-layer scheduler that keeps each layers effective step below a safe limit, stabilizing training and improving accuracy across concatenated ReLU (CReLU), Wasserstein regularization, and L2 weight decay, with learned learning-rate trajectories that mirror canonical decay.

Related papers

Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives [22.29000001610794]
Standard negative log-likelihood for Supervised Fine-Tuning (SFT) applies uniform token-level weighting.<n>This rigidity creates a two-fold failure mode: (i) overemphasizing low-probability targets can amplify gradients on noisy supervision and disrupt robust priors, and (ii) uniform weighting provides weak sharpening when the model is already confident.<n>Existing methods fail to resolve the resulting plasticity--stability dilemma, often suppressing necessary learning signals alongside harmful ones.<n>We introduce Dynamic Entropy Fine-Tuning (DEFT), a parameter-free objective that modulates the
arXiv Detail & Related papers (2026-02-11T22:56:43Z)
Plug-and-Play Homeostatic Spark: Zero-Cost Acceleration for SNN Training Across Paradigms [40.57310813106791]
Spiking neural networks offer event driven computation, sparse activation, and hardware efficiency, yet training often converges slowly and lacks stability.<n>We present Adaptive Homeostatic Spiking Activity Regulation (AHSAR), an extremely simple plug in and training paradigm method.<n>AHSAR stabilizes optimization and accelerates convergence without changing the model architecture, loss, or gradients.
arXiv Detail & Related papers (2025-12-04T17:26:46Z)
Convergence and Generalization of Anti-Regularization for Parametric Models [0.0]
Anti-regularization introduces a reward term with a reversed sign into the loss function.<n>We formalize spectral safety conditions and trust-region constraints.<n>We design a lightweight safeguard that combines a projection operator with gradient clipping to guarantee stable intervention.
arXiv Detail & Related papers (2025-08-24T15:34:17Z)
Feature Learning Beyond the Edge of Stability [8.430481660019451]
We propose a homogeneous multilayer perceptron parameterization with hidden layer width pattern and analyze its training dynamics under gradient descent.<n>We obtain formulas for the first three Taylor coefficients of the minibatch loss during training that illuminate the connection between sharpness and feature learning.
arXiv Detail & Related papers (2025-02-18T18:23:33Z)
Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise [60.92029979853314]
We investigate the roles of gradient normalization and clipping in ensuring the convergence of Gradient Descent (SGD) under heavy-tailed noise. Our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise. We introduce an accelerated SGD variant incorporating gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
arXiv Detail & Related papers (2024-10-21T22:40:42Z)
Large Continual Instruction Assistant [59.585544987096974]
Continual Instruction Tuning (CIT) is adopted to instruct Large Models to follow human intent data by data.<n>Existing update gradient would heavily destroy the performance on previous datasets during CIT process.<n>We propose a general continual instruction tuning framework to address the challenge.
arXiv Detail & Related papers (2024-10-08T11:24:59Z)
Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z)
On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.<n>We provide a proof of this in the case of linear neural networks with a squared loss.<n>We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z)
Estimator Meets Equilibrium Perspective: A Rectified Straight Through Estimator for Binary Neural Networks Training [35.090598013305275]
Binarization of neural networks is a dominant paradigm in neural networks compression. We propose Rectified Straight Through Estimator (ReSTE) to balance the estimating error and the gradient stability. ReSTE has excellent performance and surpasses the state-of-the-art methods without any auxiliary modules or losses.
arXiv Detail & Related papers (2023-08-13T05:38:47Z)
Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning. GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients. This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails. We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z)
Improper Learning with Gradient-based Policy Optimization [62.50997487685586]
We consider an improper reinforcement learning setting where the learner is given M base controllers for an unknown Markov Decision Process. We propose a gradient-based approach that operates over a class of improper mixtures of the controllers.
arXiv Detail & Related papers (2021-02-16T14:53:55Z)
Cost Function Unrolling in Unsupervised Optical Flow [6.656273171776146]
This work focuses on the derivation of the Total Variation semi-norm commonly used in unsupervised cost functions. We derive a differentiable proxy to the hard L1 smoothness constraint in a novel iterative scheme which we refer to as Cost Unrolling.
arXiv Detail & Related papers (2020-11-30T14:10:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.