Related papers: Interpreting Adaptive Gradient Methods by Parameter Scaling for Learning-Rate-Free Optimization

Interpreting Adaptive Gradient Methods by Parameter Scaling for Learning-Rate-Free Optimization

URL: http://arxiv.org/abs/2401.03240v1
Date: Sat, 6 Jan 2024 15:45:29 GMT
Title: Interpreting Adaptive Gradient Methods by Parameter Scaling for Learning-Rate-Free Optimization
Authors: Min-Kook Suh and Seung-Woo Seo
Abstract summary: We address the challenge of estimating the learning rate for adaptive gradient methods used in training deep neural networks. While several learning-rate-free approaches have been proposed, they are typically tailored for steepest descent. In this paper, we interpret adaptive gradient methods as steepest descent applied on parameter-scaled networks.
Score: 14.009179786857802
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We address the challenge of estimating the learning rate for adaptive gradient methods used in training deep neural networks. While several learning-rate-free approaches have been proposed, they are typically tailored for steepest descent. However, although steepest descent methods offer an intuitive approach to finding minima, many deep learning applications require adaptive gradient methods to achieve faster convergence. In this paper, we interpret adaptive gradient methods as steepest descent applied on parameter-scaled networks, proposing learning-rate-free adaptive gradient methods. Experimental results verify the effectiveness of this approach, demonstrating comparable performance to hand-tuned learning rates across various scenarios. This work extends the applicability of learning-rate-free methods, enhancing training with adaptive gradient methods.

Related papers

Scaled Conjugate Gradient Method for Nonconvex Optimization in Deep Neural Networks [0.6144680854063939]
A scaled conjugate gradient method is proposed for solving non optimization problems with deep neural networks. The proposed method is shown to be superior to that of the conjugate gradient method in practical applications of image and text classification.
arXiv Detail & Related papers (2024-12-16T02:57:23Z)
Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions. Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms. We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z)
Gradient Alignment Improves Test-Time Adaptation for Medical Image Segmentation [15.791041311313448]
gradient alignment-based Test-time adaptation (GraTa) method to improve gradient direction and learning rate. GraTa method incorporates an auxiliary gradient with the pseudo one to facilitate gradient alignment. Design a dynamic learning rate based on the cosine similarity between the pseudo and auxiliary gradients.
arXiv Detail & Related papers (2024-08-14T07:37:07Z)
Angle based dynamic learning rate for gradient descent [2.5077510176642805]
We propose a novel yet simple approach to obtain an adaptive learning rate for gradient-based descent methods on classification tasks. Instead of the traditional approach of selecting adaptive learning rates via the expectation of gradient-based terms, we use the angle between the current gradient and the new gradient. We find that our method leads to the highest accuracy in most of the datasets.
arXiv Detail & Related papers (2023-04-20T16:55:56Z)
BFE and AdaBFE: A New Approach in Learning Rate Automation for Stochastic Optimization [3.541406632811038]
gradient-based optimization approach by automatically adjusting the learning rate is proposed. This approach could be an alternative method to optimize the learning rate based on the gradient descent (SGD) algorithm.
arXiv Detail & Related papers (2022-07-06T15:55:53Z)
Incorporating the Barzilai-Borwein Adaptive Step Size into Sugradient Methods for Deep Network Training [3.8762085568003406]
We adapt the learning rate using a two-point approximation to the secant equation which quasi-Newton methods are based upon. We evaluate our method using standard example network architectures on widely available datasets and compare against alternatives elsewhere in the literature.
arXiv Detail & Related papers (2022-05-27T02:12:59Z)
Adaptive Gradient Methods with Local Guarantees [48.980206926987606]
We propose an adaptive gradient method that has provable adaptive regret guarantees vs. the best local preconditioner. We demonstrate the robustness of our method in automatically choosing the optimal learning rate schedule for popular benchmarking tasks in vision and language domains.
arXiv Detail & Related papers (2022-03-02T20:45:14Z)
Bag of Tricks for Natural Policy Gradient Reinforcement Learning [87.54231228860495]
We have implemented and compared strategies that impact performance in natural policy gradient reinforcement learning. The proposed collection of strategies for performance optimization can improve results by 86% to 181% across the MuJuCo control benchmark.
arXiv Detail & Related papers (2022-01-22T17:44:19Z)
Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem) AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient. Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS) Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z)
Disentangling Adaptive Gradient Methods from Learning Rates [65.0397050979662]
We take a deeper look at how adaptive gradient methods interact with the learning rate schedule. We introduce a "grafting" experiment which decouples an update's magnitude from its direction. We present some empirical and theoretical retrospectives on the generalization of adaptive gradient methods.
arXiv Detail & Related papers (2020-02-26T21:42:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.