Related papers: A Second look at Exponential and Cosine Step Sizes: Simplicity, Adaptivity, and Performance

A Second look at Exponential and Cosine Step Sizes: Simplicity, Adaptivity, and Performance

URL: http://arxiv.org/abs/2002.05273v4
Date: Wed, 9 Jun 2021 18:26:34 GMT
Title: A Second look at Exponential and Cosine Step Sizes: Simplicity, Adaptivity, and Performance
Authors: Xiaoyu Li, Zhenxun Zhuang, Francesco Orabona
Abstract summary: Gradient Descent (SGD) is a popular tool in large-scale machine learning models. It is highly variable, depending crucially on the choice of the step sizes. A variety of strategies for tuning the step sizes have been proposed.
Score: 23.89815527019194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Stochastic Gradient Descent (SGD) is a popular tool in training large-scale machine learning models. Its performance, however, is highly variable, depending crucially on the choice of the step sizes. Accordingly, a variety of strategies for tuning the step sizes have been proposed, ranging from coordinate-wise approaches (a.k.a. ``adaptive'' step sizes) to sophisticated heuristics to change the step size in each iteration. In this paper, we study two step size schedules whose power has been repeatedly confirmed in practice: the exponential and the cosine step sizes. For the first time, we provide theoretical support for them proving convergence rates for smooth non-convex functions, with and without the Polyak-\L{}ojasiewicz (PL) condition. Moreover, we show the surprising property that these two strategies are \emph{adaptive} to the noise level in the stochastic gradients of PL functions. That is, contrary to polynomial step sizes, they achieve almost optimal performance without needing to know the noise level nor tuning their hyperparameters based on it. Finally, we conduct a fair and comprehensive empirical evaluation of real-world datasets with deep learning architectures. Results show that, even if only requiring at most two hyperparameters to tune, these two strategies best or match the performance of various finely-tuned state-of-the-art strategies.

Related papers

ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts. Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
Learning Algorithm Hyperparameters for Fast Parametric Convex Optimization [2.0403774954994858]
We introduce a machine-learning framework to learn the hyper Parameters sequence of first-order methods. We show how to learn the hyper Parameters for several popular algorithms. We showcase the effectiveness of our method with many examples, including ones from control, signal processing, and machine learning.
arXiv Detail & Related papers (2024-11-24T04:58:36Z)
Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z)
Stochastic Two Points Method for Deep Model Zeroth-order Optimization [32.459322001738144]
This paper introduces an efficient Two-Point (S2P) approach within the gradient-free regime. We present the theoretical convergence properties of S2P under the general and relaxed smoothness assumptions. We show that VS2P is highly effective in optimizing objectives for deep models.
arXiv Detail & Related papers (2024-02-02T18:39:40Z)
ELRA: Exponential learning rate adaption gradient descent optimization method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption. The main idea of the method is to adapt the $alpha by situational awareness. It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z)
Formal guarantees for heuristic optimization algorithms used in machine learning [6.978625807687497]
Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization machine learning (ML) problems. We provide formal guarantees of a few convex optimization methods and proposing improved algorithms.
arXiv Detail & Related papers (2022-07-31T19:41:22Z)
Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency [105.17746223041954]
Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. We propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.
arXiv Detail & Related papers (2022-05-26T16:34:46Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
Adaptive Gradient Methods Converge Faster with Over-Parameterization (but you should do a line-search) [32.24244211281863]
We study a simplistic setting -- smooth, convex losses with models over- parameterized enough to interpolate the data. We prove that AMSGrad with constant step-size and momentum converges to the minimizer at a faster $O(1/T)$ rate. We show that these techniques improve the convergence and generalization of adaptive gradient methods across tasks.
arXiv Detail & Related papers (2020-06-11T21:23:30Z)
Dynamic Scale Training for Object Detection [111.33112051962514]
We propose a Dynamic Scale Training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection. Experimental results demonstrate the efficacy of our proposed DST towards scale variation handling. It does not introduce inference overhead and could serve as a free lunch for general detection configurations.
arXiv Detail & Related papers (2020-04-26T16:48:17Z)
Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence [30.393999722555154]
We propose a variant of the classical Polyak step-size (Polyak, 1987) commonly used in the subgradient method. The proposed Polyak step-size (SPS) is an attractive choice for setting the learning rate for gradient descent.
arXiv Detail & Related papers (2020-02-24T20:57:23Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.