mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural
  Network Optimization
        - URL: http://arxiv.org/abs/2307.13744v1
- Date: Tue, 25 Jul 2023 18:03:29 GMT
- Title: mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural
  Network Optimization
- Authors: Yue Niu, Zalan Fabian, Sunwoo Lee, Mahdi Soltanolkotabi, Salman
  Avestimehr
- Abstract summary: We propose a momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN)
For model training at a large scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing.
Results show that mL-BFGS achieves both noticeable gradient-wise and wall-clock speedup.
- Score: 35.08820062020787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Quasi-Newton methods still face significant challenges in training
large-scale neural networks due to additional compute costs in the Hessian
related computations and instability issues in stochastic training. A
well-known method, L-BFGS that efficiently approximates the Hessian using
history parameter and gradient changes, suffers convergence instability in
stochastic training. So far, attempts that adapt L-BFGS to large-scale
stochastic training incur considerable extra overhead, which offsets its
convergence benefits in wall-clock time. In this paper, we propose mL-BFGS, a
lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton
(QN) methods in large-scale distributed deep neural network (DNN) optimization.
mL-BFGS introduces a nearly cost-free momentum scheme into L-BFGS update and
greatly reduces stochastic noise in the Hessian, therefore stabilizing
convergence during stochastic optimization. For model training at a large
scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing
compute and memory costs across all computing nodes. We provide a supporting
convergence analysis for mL-BFGS in stochastic settings. To investigate mL-BFGS
potential in large-scale DNN training, we train benchmark neural models using
mL-BFGS and compare performance with baselines (SGD, Adam, and other
quasi-Newton methods). Results show that mL-BFGS achieves both noticeable
iteration-wise and wall-clock speedup.
 
      
        Related papers
        - Efficient Large Language Model Inference with Neural Block Linearization [47.89931529975717]
 We introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference.<n>NBL replaces self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators.<n>In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks.
 arXiv  Detail & Related papers  (2025-05-27T12:01:43Z)
- Utilising Gradient-Based Proposals Within Sequential Monte Carlo   Samplers for Training of Partial Bayesian Neural Networks [3.2254941904559917]
 Partial Bayesian neural networks (pBNNs) have been shown to perform competitively with fully Bayesian neural networks.<n>We introduce a new SMC-based training method for pBNNs by utilising a guided proposal and incorporating gradient-based Markov kernels.<n>We show that our new method outperforms the state-of-the-art in terms of predictive performance and optimal loss.
 arXiv  Detail & Related papers  (2025-05-01T20:05:38Z)
- Decentralized Nonconvex Composite Federated Learning with Gradient   Tracking and Momentum [78.27945336558987]
 Decentralized server (DFL) eliminates reliance on client-client architecture.
Non-smooth regularization is often incorporated into machine learning tasks.
We propose a novel novel DNCFL algorithm to solve these problems.
 arXiv  Detail & Related papers  (2025-04-17T08:32:25Z)
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
 Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
 arXiv  Detail & Related papers  (2025-02-19T14:58:48Z)
- RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach   for Large Language Models [53.571195477043496]
 We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)
RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.
Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
 arXiv  Detail & Related papers  (2025-02-13T06:44:33Z)
- Which Optimizer Works Best for Physics-Informed Neural Networks and   Kolmogorov-Arnold Networks? [1.8175282137722093]
 We compare PINNs and PIKANs on key challenging linear, stiff, multi-scale non-linear PDEs including Burgers, Allen-Cashinsky, Ginzburg-Landau equations.
Our results reveal improvements without the use of any other enhancements typically employed in PINNs and PIKANs.
 arXiv  Detail & Related papers  (2025-01-22T21:19:42Z)
- MARS: Unleashing the Power of Variance Reduction for Training Large   Models [56.47014540413659]
 Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training.
We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
 arXiv  Detail & Related papers  (2024-11-15T18:57:39Z)
- Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
 This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
 arXiv  Detail & Related papers  (2024-10-11T04:57:48Z)
- Residual-based attention and connection to information bottleneck theory
  in PINNs [0.393259574660092]
 Physics-informed neural networks (PINNs) have seen a surge of interest in recent years.
We propose an efficient, gradient-less weighting scheme for PINNs, that accelerates the convergence of dynamic or static systems.
 arXiv  Detail & Related papers  (2023-07-01T16:29:55Z)
- Implicit Stochastic Gradient Descent for Training Physics-informed
  Neural Networks [51.92362217307946]
 Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
 arXiv  Detail & Related papers  (2023-03-03T08:17:47Z)
- Variational Linearized Laplace Approximation for Bayesian Deep Learning [11.22428369342346]
 We propose a new method for approximating Linearized Laplace Approximation (LLA) using a variational sparse Gaussian Process (GP)
Our method is based on the dual RKHS formulation of GPs and retains, as the predictive mean, the output of the original DNN.
It allows for efficient optimization, which results in sub-linear training time in the size of the training dataset.
 arXiv  Detail & Related papers  (2023-02-24T10:32:30Z)
- Partitioning sparse deep neural networks for scalable training and
  inference [8.282177703075453]
 State-of-the-art deep neural networks (DNNs) have significant computational and data management requirements.
Sparsification and pruning methods are shown to be effective in removing a large fraction of connections in DNNs.
The resulting sparse networks present unique challenges to further improve the computational efficiency of training and inference in deep learning.
 arXiv  Detail & Related papers  (2021-04-23T20:05:52Z)
- An Adaptive Memory Multi-Batch L-BFGS Algorithm for Neural Network
  Training [0.951828574518325]
 A limited memory version of the BFGS algorithm has been receiving increasing attention in recent years for large neural network training problems.
We propose a multi-batch L-BFGS algorithm, namely MB-AM, that gradually increases its trust in the curvature information.
 arXiv  Detail & Related papers  (2020-12-14T11:40:41Z)
- Stochastic Damped L-BFGS with Controlled Norm of the Hessian
  Approximation [3.0204520109309843]
 We propose a new variance- damped L-BFGS, where we leverage estimates of bounds on the largest and smallest eigen approximation to balance its quality and conditioning.
Our VARCHEN, draws from previous work that proposed a novel damped L-BFGS algorithm called SdLBFGS.
We demonstrate that VARCHEN is more robust than SdLBFGSVR and SVRG on a modified DavidNet problem.
 arXiv  Detail & Related papers  (2020-12-10T16:19:02Z)
- Neural networks with late-phase weights [66.72777753269658]
 We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
 arXiv  Detail & Related papers  (2020-07-25T13:23:37Z)
- Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
 We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
 arXiv  Detail & Related papers  (2020-06-10T08:22:41Z)
- Communication-Efficient Distributed Stochastic AUC Maximization with
  Deep Neural Networks [50.42141893913188]
 We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
 arXiv  Detail & Related papers  (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.