mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural
Network Optimization
- URL: http://arxiv.org/abs/2307.13744v1
- Date: Tue, 25 Jul 2023 18:03:29 GMT
- Title: mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural
Network Optimization
- Authors: Yue Niu, Zalan Fabian, Sunwoo Lee, Mahdi Soltanolkotabi, Salman
Avestimehr
- Abstract summary: We propose a momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN)
For model training at a large scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing.
Results show that mL-BFGS achieves both noticeable gradient-wise and wall-clock speedup.
- Score: 35.08820062020787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quasi-Newton methods still face significant challenges in training
large-scale neural networks due to additional compute costs in the Hessian
related computations and instability issues in stochastic training. A
well-known method, L-BFGS that efficiently approximates the Hessian using
history parameter and gradient changes, suffers convergence instability in
stochastic training. So far, attempts that adapt L-BFGS to large-scale
stochastic training incur considerable extra overhead, which offsets its
convergence benefits in wall-clock time. In this paper, we propose mL-BFGS, a
lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton
(QN) methods in large-scale distributed deep neural network (DNN) optimization.
mL-BFGS introduces a nearly cost-free momentum scheme into L-BFGS update and
greatly reduces stochastic noise in the Hessian, therefore stabilizing
convergence during stochastic optimization. For model training at a large
scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing
compute and memory costs across all computing nodes. We provide a supporting
convergence analysis for mL-BFGS in stochastic settings. To investigate mL-BFGS
potential in large-scale DNN training, we train benchmark neural models using
mL-BFGS and compare performance with baselines (SGD, Adam, and other
quasi-Newton methods). Results show that mL-BFGS achieves both noticeable
iteration-wise and wall-clock speedup.
Related papers
- MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training.
We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - Residual-based attention and connection to information bottleneck theory
in PINNs [0.393259574660092]
Physics-informed neural networks (PINNs) have seen a surge of interest in recent years.
We propose an efficient, gradient-less weighting scheme for PINNs, that accelerates the convergence of dynamic or static systems.
arXiv Detail & Related papers (2023-07-01T16:29:55Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Variational Linearized Laplace Approximation for Bayesian Deep Learning [11.22428369342346]
We propose a new method for approximating Linearized Laplace Approximation (LLA) using a variational sparse Gaussian Process (GP)
Our method is based on the dual RKHS formulation of GPs and retains, as the predictive mean, the output of the original DNN.
It allows for efficient optimization, which results in sub-linear training time in the size of the training dataset.
arXiv Detail & Related papers (2023-02-24T10:32:30Z) - Partitioning sparse deep neural networks for scalable training and
inference [8.282177703075453]
State-of-the-art deep neural networks (DNNs) have significant computational and data management requirements.
Sparsification and pruning methods are shown to be effective in removing a large fraction of connections in DNNs.
The resulting sparse networks present unique challenges to further improve the computational efficiency of training and inference in deep learning.
arXiv Detail & Related papers (2021-04-23T20:05:52Z) - An Adaptive Memory Multi-Batch L-BFGS Algorithm for Neural Network
Training [0.951828574518325]
A limited memory version of the BFGS algorithm has been receiving increasing attention in recent years for large neural network training problems.
We propose a multi-batch L-BFGS algorithm, namely MB-AM, that gradually increases its trust in the curvature information.
arXiv Detail & Related papers (2020-12-14T11:40:41Z) - Stochastic Damped L-BFGS with Controlled Norm of the Hessian
Approximation [3.0204520109309843]
We propose a new variance- damped L-BFGS, where we leverage estimates of bounds on the largest and smallest eigen approximation to balance its quality and conditioning.
Our VARCHEN, draws from previous work that proposed a novel damped L-BFGS algorithm called SdLBFGS.
We demonstrate that VARCHEN is more robust than SdLBFGSVR and SVRG on a modified DavidNet problem.
arXiv Detail & Related papers (2020-12-10T16:19:02Z) - Neural networks with late-phase weights [66.72777753269658]
We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
arXiv Detail & Related papers (2020-07-25T13:23:37Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.