Automatic gradient descent with generalized Newton's method
- URL: http://arxiv.org/abs/2407.02772v1
- Date: Wed, 3 Jul 2024 03:01:43 GMT
- Title: Automatic gradient descent with generalized Newton's method
- Authors: Zhiqi Bu, Shiyun Xu,
- Abstract summary: We propose a Hessian-informed approach that applies to any topics such as SGD and Adam.
Our method automatically and dynamically selects the learning rate that accelerates the convergence.
In practice, out method is easily implementable, since it only requires additional forward passes with almost zero computational overhead.
- Score: 8.885727065823156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose the generalized Newton's method (GeN) -- a Hessian-informed approach that applies to any optimizer such as SGD and Adam, and covers the Newton-Raphson method as a sub-case. Our method automatically and dynamically selects the learning rate that accelerates the convergence, without the intensive tuning of the learning rate scheduler. In practice, out method is easily implementable, since it only requires additional forward passes with almost zero computational overhead (in terms of training time and memory cost), if the overhead is amortized over many iterations. We present extensive experiments on language and vision tasks (e.g. GPT and ResNet) to showcase that GeN optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers. Code to be released at \url{https://github.com/ShiyunXu/AutoGeN}.
Related papers
- Gradient Descent with Provably Tuned Learning-rate Schedules [14.391648046717073]
We develop novel analytical tools for provably tuning factors in gradient-based algorithms.<n>Our analysis applies to neural networks with commonly used activation functions.
arXiv Detail & Related papers (2025-12-04T18:49:58Z) - Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning [7.966927192439667]
We introduce a preconditioning procedure that accelerates Newton-Schulz convergence and reduces its computational cost.<n>Our publicly available implementation achieves up to a 2.8x speedup in the Newton-Schulz approximation.
arXiv Detail & Related papers (2025-12-04T10:06:22Z) - AutoGD: Automatic Learning Rate Selection for Gradient Descent [54.195493042469]
We introduce AutoGD: a gradient descent method that automatically determines whether to increase or decrease the learning rate at a given iteration.<n>We show that we can recover the optimal rate of GD (up to a constant) for a broad class of functions without knowledge of smoothness constants.
arXiv Detail & Related papers (2025-10-10T23:47:22Z) - AutoSGD: Automatic Learning Rate Selection for Stochastic Gradient Descent [58.05410015124021]
We introduce AutoSGD: an SGD method that automatically determines whether to increase or decrease the learning rate at a given iteration.<n> Empirical results suggest strong performance of the method on a variety of traditional optimization problems and machine learning tasks.
arXiv Detail & Related papers (2025-05-27T18:25:21Z) - On Newton's Method to Unlearn Neural Networks [44.85793893441989]
We seek approximate unlearning algorithms for neural networks (NNs) that return identical models to the retrained oracle.
We propose CureNewton's method, a principle approach that leverages cubic regularization to handle the Hessian degeneracy effectively.
Experiments across different models and datasets show that our method can achieve competitive unlearning performance to the state-of-the-art algorithm in practical unlearning settings.
arXiv Detail & Related papers (2024-06-20T17:12:20Z) - Mechanic: A Learning Rate Tuner [52.4242550204696]
We introduce a technique for tuning the learning rate scale factor of any base optimization algorithm and schedule automatically, which we call textscmechanic.
We rigorously evaluate textscmechanic on a range of large scale deep learning tasks with varying batch sizes, schedules, and base optimization algorithms.
arXiv Detail & Related papers (2023-05-31T19:32:43Z) - Improving Representational Continuity via Continued Pretraining [76.29171039601948]
Transfer learning community (LP-FT) outperforms naive training and other continual learning methods.
LP-FT also reduces forgetting in a real world satellite remote sensing dataset (FMoW)
variant of LP-FT gets state-of-the-art accuracies on an NLP continual learning benchmark.
arXiv Detail & Related papers (2023-02-26T10:39:38Z) - On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning [0.0]
We study the behaviour of quasi-Newton training algorithms for deep memory networks.
We show that quasi-Newtons are efficient and able to outperform in some instances the well-known first-order Adam run.
arXiv Detail & Related papers (2022-05-18T20:53:58Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - AutoDrop: Training Deep Learning Models with Automatic Learning Rate
Drop [16.396327849817464]
We develop an algorithm that realizes the learning rate drop $textitautomatically$.
We show that our method improves over SOTA training approaches.
arXiv Detail & Related papers (2021-11-30T11:55:21Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Faster Meta Update Strategy for Noise-Robust Deep Learning [62.08964100618873]
We introduce a novel Faster Meta Update Strategy (FaMUS) to replace the most expensive step in the meta gradient with a faster layer-wise approximation.
We show our method is able to save two-thirds of the training time while still maintaining the comparable or achieving even better generalization performance.
arXiv Detail & Related papers (2021-04-30T16:19:07Z) - RNN Training along Locally Optimal Trajectories via Frank-Wolfe
Algorithm [50.76576946099215]
We propose a novel and efficient training method for RNNs by iteratively seeking a local minima on the loss surface within a small region.
We develop a novel RNN training method that, surprisingly, even with the additional cost, the overall training cost is empirically observed to be lower than back-propagation.
arXiv Detail & Related papers (2020-10-12T01:59:18Z) - Second-order Neural Network Training Using Complex-step Directional
Derivative [41.4333906662624]
We introduce a numerical algorithm for second-order neural network training.
We tackle the practical obstacle of Hessian calculation by using the complex-step finite difference.
We believe our method will inspire a wide-range of new algorithms for deep learning and numerical optimization.
arXiv Detail & Related papers (2020-09-15T13:46:57Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - On the Promise of the Stochastic Generalized Gauss-Newton Method for
Training DNNs [37.96456928567548]
We study a generalized Gauss-Newton method (SGN) for training DNNs.
SGN is a second-order optimization method, with efficient iterations, that we demonstrate to often require substantially fewer iterations than standard SGD to converge.
We show that SGN does not only substantially improve over SGD in terms of the number of iterations, but also in terms of runtime.
This is made possible by an efficient, easy-to-use and flexible implementation of SGN we propose in the Theano deep learning platform.
arXiv Detail & Related papers (2020-06-03T17:35:54Z) - Deep Neural Network Learning with Second-Order Optimizers -- a Practical
Study with a Stochastic Quasi-Gauss-Newton Method [0.0]
We introduce and study a second-order quasi-Gauss-Newton (SQGN) optimization method that combines ideas from quasi-Newton methods, Gauss-Newton methods, and variance reduction to address this problem.
We discuss the implementation of SQGN with benchmark, and we compare its convergence and computational performance to selected first-order methods.
arXiv Detail & Related papers (2020-04-06T23:41:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.