Layer-Wise Partitioning and Merging for Efficient and Scalable Deep
Learning
- URL: http://arxiv.org/abs/2207.11019v1
- Date: Fri, 22 Jul 2022 11:47:34 GMT
- Title: Layer-Wise Partitioning and Merging for Efficient and Scalable Deep
Learning
- Authors: Samson B. Akintoye, Liangxiu Han, Huw Lloyd, Xin Zhang, Darren Dancey,
Haoming Chen, and Daoqiang Zhang
- Abstract summary: We have proposed a novel layer-wise partitioning and merging, forward and backward pass parallel framework to provide better training performance.
The experimental evaluation on real use cases shows that the proposed method outperforms the state-of-the-art approaches in terms of training speed.
- Score: 16.38731019298993
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Neural Network (DNN) models are usually trained sequentially from one
layer to another, which causes forward, backward and update locking's problems,
leading to poor performance in terms of training time. The existing parallel
strategies to mitigate these problems provide suboptimal runtime performance.
In this work, we have proposed a novel layer-wise partitioning and merging,
forward and backward pass parallel framework to provide better training
performance. The novelty of the proposed work consists of 1) a layer-wise
partition and merging model which can minimise communication overhead between
devices without the memory cost of existing strategies during the training
process; 2) a forward pass and backward pass parallelisation and optimisation
to address the update locking problem and minimise the total training cost. The
experimental evaluation on real use cases shows that the proposed method
outperforms the state-of-the-art approaches in terms of training speed; and
achieves almost linear speedup without compromising the accuracy performance of
the non-parallel approach.
Related papers
- LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging [80.17238673443127]
LiNeS is a post-training editing technique designed to preserve pre-trained generalization while enhancing fine-tuned task performance.
LiNeS demonstrates significant improvements in both single-task and multi-task settings across various benchmarks in vision and natural language processing.
arXiv Detail & Related papers (2024-10-22T16:26:05Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy [67.45518210171024]
Dynamic computation methods have shown notable acceleration for Large Language Models (LLMs) by skipping several layers of computations.
We propose a Unified Layer Skipping strategy, which selects the number of layers to skip computation based solely on the target speedup ratio.
Experimental results on two common tasks, i.e., machine translation and text summarization, indicate that given a target speedup ratio, the Unified Layer Skipping strategy significantly enhances both the inference performance and the actual model throughput.
arXiv Detail & Related papers (2024-04-10T12:12:07Z) - Efficient Stagewise Pretraining via Progressive Subnetworks [53.00045381931778]
The prevailing view suggests that stagewise dropping strategies, such as layer dropping, are ineffective when compared to stacking-based approaches.
This paper challenges this notion by demonstrating that, with proper design, dropping strategies can be competitive, if not better, than stacking methods.
We propose an instantiation of this framework - Random Part Training (RAPTR) - that selects and trains only a random subnetwork at each step, progressively increasing the size in stages.
arXiv Detail & Related papers (2024-02-08T18:49:09Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Practical Convex Formulation of Robust One-hidden-layer Neural Network
Training [12.71266194474117]
We show that the training of a one-hidden-layer, scalar-output fully-connected ReLULU neural network can be reformulated as a finite-dimensional convex program.
We derive a convex optimization approach to efficiently solve the "adversarial training" problem.
Our method can be applied to binary classification and regression, and provides an alternative to the current adversarial training methods.
arXiv Detail & Related papers (2021-05-25T22:06:27Z) - Optimizer Fusion: Efficient Training with Better Locality and
Parallelism [11.656318345362804]
Experimental results show that we can achieve an up to 20% training time reduction on various configurations.
Since our methods do not alter the algorithm, they can be used as a general "plug-in" technique to the training process.
arXiv Detail & Related papers (2021-04-01T03:44:13Z) - A Practical Layer-Parallel Training Algorithm for Residual Networks [41.267919563145604]
gradient-based algorithms for training ResNets typically require a forward pass of the input data, followed by back-propagating the objective gradient to update parameters.
We propose a novel serial-parallel hybrid training strategy to enable the use of data augmentation, together with downsampling filters to reduce the communication cost.
arXiv Detail & Related papers (2020-09-03T06:03:30Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z) - Restructuring, Pruning, and Adjustment of Deep Models for Parallel
Distributed Inference [15.720414948573753]
We consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers)
We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model.
We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation.
arXiv Detail & Related papers (2020-08-19T06:44:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.