Related papers: mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural Network Optimization

mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural Network Optimization

URL: http://arxiv.org/abs/2307.13744v1
Date: Tue, 25 Jul 2023 18:03:29 GMT
Title: mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural Network Optimization
Authors: Yue Niu, Zalan Fabian, Sunwoo Lee, Mahdi Soltanolkotabi, Salman Avestimehr
Abstract summary: We propose a momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN) For model training at a large scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing. Results show that mL-BFGS achieves both noticeable gradient-wise and wall-clock speedup.
Score: 35.08820062020787
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quasi-Newton methods still face significant challenges in training large-scale neural networks due to additional compute costs in the Hessian related computations and instability issues in stochastic training. A well-known method, L-BFGS that efficiently approximates the Hessian using history parameter and gradient changes, suffers convergence instability in stochastic training. So far, attempts that adapt L-BFGS to large-scale stochastic training incur considerable extra overhead, which offsets its convergence benefits in wall-clock time. In this paper, we propose mL-BFGS, a lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN) optimization. mL-BFGS introduces a nearly cost-free momentum scheme into L-BFGS update and greatly reduces stochastic noise in the Hessian, therefore stabilizing convergence during stochastic optimization. For model training at a large scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing nodes. We provide a supporting convergence analysis for mL-BFGS in stochastic settings. To investigate mL-BFGS potential in large-scale DNN training, we train benchmark neural models using mL-BFGS and compare performance with baselines (SGD, Adam, and other quasi-Newton methods). Results show that mL-BFGS achieves both noticeable iteration-wise and wall-clock speedup.

Related papers

Efficient Large Language Model Inference with Neural Block Linearization [47.89931529975717]
We introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference.<n>NBL replaces self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators.<n>In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks.
arXiv Detail & Related papers (2025-05-27T12:01:43Z)
Utilising Gradient-Based Proposals Within Sequential Monte Carlo Samplers for Training of Partial Bayesian Neural Networks [3.2254941904559917]
Partial Bayesian neural networks (pBNNs) have been shown to perform competitively with fully Bayesian neural networks.<n>We introduce a new SMC-based training method for pBNNs by utilising a guided proposal and incorporating gradient-based Markov kernels.<n>We show that our new method outperforms the state-of-the-art in terms of predictive performance and optimal loss.
arXiv Detail & Related papers (2025-05-01T20:05:38Z)
Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture. Non-smooth regularization is often incorporated into machine learning tasks. We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE) RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Which Optimizer Works Best for Physics-Informed Neural Networks and Kolmogorov-Arnold Networks? [1.8175282137722093]
We compare PINNs and PIKANs on key challenging linear, stiff, multi-scale non-linear PDEs including Burgers, Allen-Cashinsky, Ginzburg-Landau equations. Our results reveal improvements without the use of any other enhancements typically employed in PINNs and PIKANs.
arXiv Detail & Related papers (2025-01-22T21:19:42Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training. We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation. We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources. We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z)
Residual-based attention and connection to information bottleneck theory in PINNs [0.393259574660092]
Physics-informed neural networks (PINNs) have seen a surge of interest in recent years. We propose an efficient, gradient-less weighting scheme for PINNs, that accelerates the convergence of dynamic or static systems.
arXiv Detail & Related papers (2023-07-01T16:29:55Z)
Implicit Stochastic Gradient Descent for Training Physics-informed Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems. PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features. In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z)
Variational Linearized Laplace Approximation for Bayesian Deep Learning [11.22428369342346]
We propose a new method for approximating Linearized Laplace Approximation (LLA) using a variational sparse Gaussian Process (GP) Our method is based on the dual RKHS formulation of GPs and retains, as the predictive mean, the output of the original DNN. It allows for efficient optimization, which results in sub-linear training time in the size of the training dataset.
arXiv Detail & Related papers (2023-02-24T10:32:30Z)
Partitioning sparse deep neural networks for scalable training and inference [8.282177703075453]
State-of-the-art deep neural networks (DNNs) have significant computational and data management requirements. Sparsification and pruning methods are shown to be effective in removing a large fraction of connections in DNNs. The resulting sparse networks present unique challenges to further improve the computational efficiency of training and inference in deep learning.
arXiv Detail & Related papers (2021-04-23T20:05:52Z)
An Adaptive Memory Multi-Batch L-BFGS Algorithm for Neural Network Training [0.951828574518325]
A limited memory version of the BFGS algorithm has been receiving increasing attention in recent years for large neural network training problems. We propose a multi-batch L-BFGS algorithm, namely MB-AM, that gradually increases its trust in the curvature information.
arXiv Detail & Related papers (2020-12-14T11:40:41Z)
Stochastic Damped L-BFGS with Controlled Norm of the Hessian Approximation [3.0204520109309843]
We propose a new variance- damped L-BFGS, where we leverage estimates of bounds on the largest and smallest eigen approximation to balance its quality and conditioning. Our VARCHEN, draws from previous work that proposed a novel damped L-BFGS algorithm called SdLBFGS. We demonstrate that VARCHEN is more robust than SdLBFGSVR and SVRG on a modified DavidNet problem.
arXiv Detail & Related papers (2020-12-10T16:19:02Z)
Neural networks with late-phase weights [66.72777753269658]
We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning. At the end of learning, we obtain back a single model by taking a spatial average in weight space.
arXiv Detail & Related papers (2020-07-25T13:23:37Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network. Our model requires a much less number of communication rounds and still a number of communication rounds in theory. Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.