mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural
Network Optimization
- URL: http://arxiv.org/abs/2307.13744v1
- Date: Tue, 25 Jul 2023 18:03:29 GMT
- Title: mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural
Network Optimization
- Authors: Yue Niu, Zalan Fabian, Sunwoo Lee, Mahdi Soltanolkotabi, Salman
Avestimehr
- Abstract summary: We propose a momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN)
For model training at a large scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing.
Results show that mL-BFGS achieves both noticeable gradient-wise and wall-clock speedup.
- Score: 35.08820062020787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quasi-Newton methods still face significant challenges in training
large-scale neural networks due to additional compute costs in the Hessian
related computations and instability issues in stochastic training. A
well-known method, L-BFGS that efficiently approximates the Hessian using
history parameter and gradient changes, suffers convergence instability in
stochastic training. So far, attempts that adapt L-BFGS to large-scale
stochastic training incur considerable extra overhead, which offsets its
convergence benefits in wall-clock time. In this paper, we propose mL-BFGS, a
lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton
(QN) methods in large-scale distributed deep neural network (DNN) optimization.
mL-BFGS introduces a nearly cost-free momentum scheme into L-BFGS update and
greatly reduces stochastic noise in the Hessian, therefore stabilizing
convergence during stochastic optimization. For model training at a large
scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing
compute and memory costs across all computing nodes. We provide a supporting
convergence analysis for mL-BFGS in stochastic settings. To investigate mL-BFGS
potential in large-scale DNN training, we train benchmark neural models using
mL-BFGS and compare performance with baselines (SGD, Adam, and other
quasi-Newton methods). Results show that mL-BFGS achieves both noticeable
iteration-wise and wall-clock speedup.
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [95.32315448601241]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)
RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.
Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - Which Optimizer Works Best for Physics-Informed Neural Networks and Kolmogorov-Arnold Networks? [1.8175282137722093]
Physics-In Arnold Neural Networks (PINNs) have revolutionized the computation of partial differential equations (PDEs)
These PINNs integrate PDEs into the neural network's training process as soft constraints.
arXiv Detail & Related papers (2025-01-22T21:19:42Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Variational Linearized Laplace Approximation for Bayesian Deep Learning [11.22428369342346]
We propose a new method for approximating Linearized Laplace Approximation (LLA) using a variational sparse Gaussian Process (GP)
Our method is based on the dual RKHS formulation of GPs and retains, as the predictive mean, the output of the original DNN.
It allows for efficient optimization, which results in sub-linear training time in the size of the training dataset.
arXiv Detail & Related papers (2023-02-24T10:32:30Z) - Partitioning sparse deep neural networks for scalable training and
inference [8.282177703075453]
State-of-the-art deep neural networks (DNNs) have significant computational and data management requirements.
Sparsification and pruning methods are shown to be effective in removing a large fraction of connections in DNNs.
The resulting sparse networks present unique challenges to further improve the computational efficiency of training and inference in deep learning.
arXiv Detail & Related papers (2021-04-23T20:05:52Z) - An Adaptive Memory Multi-Batch L-BFGS Algorithm for Neural Network
Training [0.951828574518325]
A limited memory version of the BFGS algorithm has been receiving increasing attention in recent years for large neural network training problems.
We propose a multi-batch L-BFGS algorithm, namely MB-AM, that gradually increases its trust in the curvature information.
arXiv Detail & Related papers (2020-12-14T11:40:41Z) - Stochastic Damped L-BFGS with Controlled Norm of the Hessian
Approximation [3.0204520109309843]
We propose a new variance- damped L-BFGS, where we leverage estimates of bounds on the largest and smallest eigen approximation to balance its quality and conditioning.
Our VARCHEN, draws from previous work that proposed a novel damped L-BFGS algorithm called SdLBFGS.
We demonstrate that VARCHEN is more robust than SdLBFGSVR and SVRG on a modified DavidNet problem.
arXiv Detail & Related papers (2020-12-10T16:19:02Z) - Neural networks with late-phase weights [66.72777753269658]
We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
arXiv Detail & Related papers (2020-07-25T13:23:37Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.