Related papers: AutoDrop: Training Deep Learning Models with Automatic Learning Rate Drop

AutoDrop: Training Deep Learning Models with Automatic Learning Rate Drop

URL: http://arxiv.org/abs/2111.15317v1
Date: Tue, 30 Nov 2021 11:55:21 GMT
Title: AutoDrop: Training Deep Learning Models with Automatic Learning Rate Drop
Authors: Yunfei Teng, Jing Wang, Anna Choromanska
Abstract summary: We develop an algorithm that realizes the learning rate drop $textitautomatically$. We show that our method improves over SOTA training approaches.
Score: 16.396327849817464
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern deep learning (DL) architectures are trained using variants of the SGD algorithm that is run with a $\textit{manually}$ defined learning rate schedule, i.e., the learning rate is dropped at the pre-defined epochs, typically when the training loss is expected to saturate. In this paper we develop an algorithm that realizes the learning rate drop $\textit{automatically}$. The proposed method, that we refer to as AutoDrop, is motivated by the observation that the angular velocity of the model parameters, i.e., the velocity of the changes of the convergence direction, for a fixed learning rate initially increases rapidly and then progresses towards soft saturation. At saturation the optimizer slows down thus the angular velocity saturation is a good indicator for dropping the learning rate. After the drop, the angular velocity "resets" and follows the previously described pattern - it increases again until saturation. We show that our method improves over SOTA training approaches: it accelerates the training of DL models and leads to a better generalization. We also show that our method does not require any extra hyperparameter tuning. AutoDrop is furthermore extremely simple to implement and computationally cheap. Finally, we develop a theoretical framework for analyzing our algorithm and provide convergence guarantees.

Related papers

Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning [24.03926595342341]
Low-Rank Adaptation (LoRA) is a tool for parameter-efficient finetuning of large models.<n>It is unclear how the optimal learning rate scales with adapter rank.<n>We introduce Maximal-Update Adaptation ($$A), a theoretical framework that characterizes how the "optimal" learning rate should scale.
arXiv Detail & Related papers (2026-02-05T21:28:59Z)
AutoGD: Automatic Learning Rate Selection for Gradient Descent [54.195493042469]
We introduce AutoGD: a gradient descent method that automatically determines whether to increase or decrease the learning rate at a given iteration.<n>We show that we can recover the optimal rate of GD (up to a constant) for a broad class of functions without knowledge of smoothness constants.
arXiv Detail & Related papers (2025-10-10T23:47:22Z)
Automatic gradient descent with generalized Newton's method [8.885727065823156]
We propose a Hessian-informed approach that applies to any topics such as SGD and Adam. Our method automatically and dynamically selects the learning rate that accelerates the convergence. In practice, out method is easily implementable, since it only requires additional forward passes with almost zero computational overhead.
arXiv Detail & Related papers (2024-07-03T03:01:43Z)
Highway Graph to Accelerate Reinforcement Learning [18.849312069946993]
We propose a novel graph structure, named highway graph, to model the state transition. By integrating the highway graph into RL, the RL training can be remarkably accelerated in the early stages. Deep neural network-based agent is trained using the highway graph, resulting in better generalization and lower storage costs.
arXiv Detail & Related papers (2024-05-20T02:09:07Z)
Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z)
Partial End-to-end Reinforcement Learning for Robustness Against Modelling Error in Autonomous Racing [0.0]
This paper addresses the issue of increasing the performance of reinforcement learning (RL) solutions for autonomous racing cars. We propose a partial end-to-end algorithm that decouples the planning and control tasks. By leveraging the robustness of a classical controller, our partial end-to-end driving algorithm exhibits better robustness towards model mismatches than standard end-to-end algorithms.
arXiv Detail & Related papers (2023-12-11T14:27:10Z)
Mechanic: A Learning Rate Tuner [52.4242550204696]
We introduce a technique for tuning the learning rate scale factor of any base optimization algorithm and schedule automatically, which we call textscmechanic. We rigorously evaluate textscmechanic on a range of large scale deep learning tasks with varying batch sizes, schedules, and base optimization algorithms.
arXiv Detail & Related papers (2023-05-31T19:32:43Z)
Flatter, faster: scaling momentum for optimal speedup of SGD [0.0]
We study training dynamics arising from interplay between gradient descent (SGD) and label noise and momentum in the training of neural networks. We find that scaling the momentum hyper parameter $1-NISTbeta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization.
arXiv Detail & Related papers (2022-10-28T20:41:48Z)
Adaptive Learning Rate and Momentum for Training Deep Neural Networks [0.0]
We develop a fast training method motivated by the nonlinear Conjugate Gradient (CG) framework. Experiments in image classification datasets show that our method yields faster convergence than other local solvers.
arXiv Detail & Related papers (2021-06-22T05:06:56Z)
Correcting Momentum in Temporal Difference Learning [95.62766731469671]
We argue that momentum in Temporal Difference (TD) learning accumulates gradients that become doubly stale. We show that this phenomenon exists, and then propose a first-order correction term to momentum. An important insight of this work is that deep RL methods are not always best served by directly importing techniques from the supervised setting.
arXiv Detail & Related papers (2021-06-07T20:41:15Z)
Deep Feedback Inverse Problem Solver [141.26041463617963]
We present an efficient, effective, and generic approach towards solving inverse problems. We leverage the feedback signal provided by the forward process and learn an iterative update model. Our approach does not have any restrictions on the forward process; it does not require any prior knowledge either.
arXiv Detail & Related papers (2021-01-19T16:49:06Z)
Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem) AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient. Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.