Related papers: Stepping on the Edge: Curvature Aware Learning Rate Tuners

Stepping on the Edge: Curvature Aware Learning Rate Tuners

URL: http://arxiv.org/abs/2407.06183v1
Date: Mon, 8 Jul 2024 17:56:00 GMT
Title: Stepping on the Edge: Curvature Aware Learning Rate Tuners
Authors: Vincent Roulet, Atish Agarwala, Jean-Bastien Grill, Grzegorz Swirszcz, Mathieu Blondel, Fabian Pedregosa,
Abstract summary: Curvature information is the largest eigenvalue of the loss Hessian, known as the sharpness. Recent work has shown that curvature information undergoes complex dynamics during training. We analyze the closed-loop feedback effect between learning rate tuning and curvature.
Score: 24.95412499942206
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Curvature information -- particularly, the largest eigenvalue of the loss Hessian, known as the sharpness -- often forms the basis for learning rate tuners. However, recent work has shown that the curvature information undergoes complex dynamics during training, going from a phase of increasing sharpness to eventual stabilization. We analyze the closed-loop feedback effect between learning rate tuning and curvature. We find that classical learning rate tuners may yield greater one-step loss reduction, yet they ultimately underperform in the long term when compared to constant learning rates in the full batch regime. These models break the stabilization of the sharpness, which we explain using a simplified model of the joint dynamics of the learning rate and the curvature. To further investigate these effects, we introduce a new learning rate tuning method, Curvature Dynamics Aware Tuning (CDAT), which prioritizes long term curvature stabilization over instantaneous progress on the objective. In the full batch regime, CDAT shows behavior akin to prefixed warm-up schedules on deep learning objectives, outperforming tuned constant learning rates. In the mini batch regime, we observe that stochasticity introduces confounding effects that explain the previous success of some learning rate tuners at appropriate batch sizes. Our findings highlight the critical role of understanding the joint dynamics of the learning rate and curvature, beyond greedy minimization, to diagnose failures and design effective adaptive learning rate tuners.

Related papers

A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules [67.87680482844884]
We present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay.
arXiv Detail & Related papers (2025-03-17T04:36:45Z)
SAFE: Slow and Fast Parameter-Efficient Tuning for Continual Learning with Pre-Trained Models [26.484208658326857]
Continual learning aims to incrementally acquire new concepts in data streams while resisting forgetting previous knowledge. With the rise of powerful pre-trained models (PTMs), there is a growing interest in training incremental learning systems.
arXiv Detail & Related papers (2024-11-04T15:34:30Z)
Temporal-Difference Variational Continual Learning [89.32940051152782]
A crucial capability of Machine Learning models in real-world applications is the ability to continuously learn new tasks. In Continual Learning settings, models often struggle to balance learning new tasks with retaining previous knowledge. We propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations.
arXiv Detail & Related papers (2024-10-10T10:58:41Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Normalization and effective learning rates in reinforcement learning [52.59508428613934]
Normalization layers have recently experienced a renaissance in the deep reinforcement learning and continual learning literature. We show that normalization brings with it a subtle but important side effect: an equivalence between growth in the norm of the network parameters and decay in the effective learning rate. We propose to make the learning rate schedule explicit with a simple re- parameterization which we call Normalize-and-Project.
arXiv Detail & Related papers (2024-07-01T20:58:01Z)
The Marginal Value of Momentum for Small Learning Rate SGD [20.606430391298815]
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without gradient noise regimes. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training where the optimal learning rate is not very large.
arXiv Detail & Related papers (2023-07-27T21:01:26Z)
A Loss Curvature Perspective on Training Instability in Deep Learning [28.70491071044542]
We study the evolution of the loss Hessian across many classification tasks in order to understand the effect curvature of the loss has on the training dynamics. Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization.
arXiv Detail & Related papers (2021-10-08T20:25:48Z)
Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates [67.19481956584465]
It has been experimentally observed that the efficiency of distributed training with saturation (SGD) depends decisively on the batch size and -- in implementations -- on the staleness. We show that our results are tight and illustrate key findings in numerical experiments.
arXiv Detail & Related papers (2021-03-03T12:08:23Z)
Acceleration via Fractal Learning Rate Schedules [37.878672787331105]
We show that the learning rate schedule remains notoriously difficult to understand and expensive to tune. We reinterpret an iterative algorithm from the numerical analysis literature as what we call the Chebyshev learning rate schedule for accelerating vanilla gradient descent. We provide some experiments and discussion to challenge current understandings of the "edge of stability" in deep learning.
arXiv Detail & Related papers (2021-03-01T22:52:13Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate. We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.