Scalable K-FAC Training for Deep Neural Networks with Distributed
Preconditioning
- URL: http://arxiv.org/abs/2206.15143v1
- Date: Thu, 30 Jun 2022 09:22:25 GMT
- Title: Scalable K-FAC Training for Deep Neural Networks with Distributed
Preconditioning
- Authors: Lin Zhang, Shaohuai Shi, Wei Wang, Bo Li
- Abstract summary: We propose DP-KFAC, a novel distributed preconditioning scheme for deep neural network (DNN) training.
DP-KFAC reduces computation overhead by 1.55x-1.65x, the communication cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each second-order update.
- Score: 19.04755792575149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The second-order optimization methods, notably the D-KFAC (Distributed
Kronecker Factored Approximate Curvature) algorithms, have gained traction on
accelerating deep neural network (DNN) training on GPU clusters. However,
existing D-KFAC algorithms require to compute and communicate a large volume of
second-order information, i.e., Kronecker factors (KFs), before preconditioning
gradients, resulting in large computation and communication overheads as well
as a high memory footprint. In this paper, we propose DP-KFAC, a novel
distributed preconditioning scheme that distributes the KF constructing tasks
at different DNN layers to different workers. DP-KFAC not only retains the
convergence property of the existing D-KFAC algorithms but also enables three
benefits: reduced computation overhead in constructing KFs, no communication of
KFs, and low memory footprint. Extensive experiments on a 64-GPU cluster show
that DP-KFAC reduces the computation overhead by 1.55x-1.65x, the communication
cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each
second-order update compared to the state-of-the-art D-KFAC methods.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - BDC-Occ: Binarized Deep Convolution Unit For Binarized Occupancy Network [55.21288428359509]
Existing 3D occupancy networks demand significant hardware resources, hindering the deployment of edge devices.
We propose a novel binarized deep convolution (BDC) unit that effectively enhances performance while increasing the number of binarized convolutional layers.
Our BDC-Occ model is created by applying the proposed BDC unit to binarize the existing 3D occupancy networks.
arXiv Detail & Related papers (2024-05-27T10:44:05Z) - Kronecker-Factored Approximate Curvature for Physics-Informed Neural Networks [3.7308074617637588]
We propose Kronecker-factored approximate curvature (KFAC) for PINN losses that greatly reduces the computational cost and allows scaling to much larger networks.
We find that our KFAC-based gradients are competitive with expensive second-order methods on small problems, scale more favorably to higher-dimensional neural networks and PDEs, and consistently outperform first-order methods and LBFGS.
arXiv Detail & Related papers (2024-05-24T14:36:02Z) - Kronecker-Factored Approximate Curvature for Modern Neural Network
Architectures [85.76673783330334]
Two different settings of linear weight-sharing layers motivate two flavours of Kronecker-Factored Approximate Curvature (K-FAC)
We show they are exact for deep linear networks with weight-sharing in their respective setting.
We observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer.
arXiv Detail & Related papers (2023-11-01T16:37:00Z) - Analysis and Comparison of Two-Level KFAC Methods for Training Deep
Neural Networks [0.0]
We investigate the interest of restoring some low-frequency interactions between layers by means of two-level methods.
Inspired from domain decomposition, several two-level corrections to KFAC using different coarse spaces are proposed and assessed.
arXiv Detail & Related papers (2023-03-31T14:21:53Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Brand New K-FACs: Speeding up K-FAC with Online Decomposition Updates [0.0]
We exploit the exponential-average construction paradigm of the K-factors, and use online numerical linear algebra techniques.
We propose a K-factor inverse update which scales linearly in layer size.
We also propose an inverse application procedure which scales linearly as well.
arXiv Detail & Related papers (2022-10-16T09:41:23Z) - LKD-Net: Large Kernel Convolution Network for Single Image Dehazing [70.46392287128307]
We propose a novel Large Kernel Convolution Dehaze Block (LKD Block) consisting of the Decomposition deep-wise Large Kernel Convolution Block (DLKCB) and the Channel Enhanced Feed-forward Network (CEFN)
The designed DLKCB can split the deep-wise large kernel convolution into a smaller depth-wise convolution and a depth-wise dilated convolution without introducing massive parameters and computational overhead.
Our LKD-Net dramatically outperforms the Transformer-based method Dehamer with only 1.79% #Param and 48.9% FLOPs.
arXiv Detail & Related papers (2022-09-05T06:56:48Z) - Gradient Descent on Neurons and its Link to Approximate Second-Order
Optimization [0.913755431537592]
We show that Kronecker-Factored, block-diagonal curvature estimates (KFAC) significantly outperforms true second-order updates.
We also show that KFAC approximates a first-order gradient algorithm, which performs a gradient descent on rather than weights.
arXiv Detail & Related papers (2022-01-28T17:06:26Z) - Accelerating Distributed K-FAC with Smart Parallelism of Computing and
Communication Tasks [13.552262050816616]
Kronecker-Factored Approximate Curvature (KFAC) is one of the most efficient approximation algorithms for training deep models.
Yet, when leveraging GPU clusters to train models with KFAC, it incurs extensive computation as well as introduces extra communications during each iteration.
We propose D-KFAC with smart parallelism of computing and communication tasks to reduce the iteration time.
arXiv Detail & Related papers (2021-07-14T08:01:07Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.