Related papers: Layer-Wise Partitioning and Merging for Efficient and Scalable Deep Learning

Layer-Wise Partitioning and Merging for Efficient and Scalable Deep Learning

URL: http://arxiv.org/abs/2207.11019v1
Date: Fri, 22 Jul 2022 11:47:34 GMT
Title: Layer-Wise Partitioning and Merging for Efficient and Scalable Deep Learning
Authors: Samson B. Akintoye, Liangxiu Han, Huw Lloyd, Xin Zhang, Darren Dancey, Haoming Chen, and Daoqiang Zhang
Abstract summary: We have proposed a novel layer-wise partitioning and merging, forward and backward pass parallel framework to provide better training performance. The experimental evaluation on real use cases shows that the proposed method outperforms the state-of-the-art approaches in terms of training speed.
Score: 16.38731019298993
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep Neural Network (DNN) models are usually trained sequentially from one layer to another, which causes forward, backward and update locking's problems, leading to poor performance in terms of training time. The existing parallel strategies to mitigate these problems provide suboptimal runtime performance. In this work, we have proposed a novel layer-wise partitioning and merging, forward and backward pass parallel framework to provide better training performance. The novelty of the proposed work consists of 1) a layer-wise partition and merging model which can minimise communication overhead between devices without the memory cost of existing strategies during the training process; 2) a forward pass and backward pass parallelisation and optimisation to address the update locking problem and minimise the total training cost. The experimental evaluation on real use cases shows that the proposed method outperforms the state-of-the-art approaches in terms of training speed; and achieves almost linear speedup without compromising the accuracy performance of the non-parallel approach.

Related papers

Model Parallelism With Subnetwork Data Parallelism [16.716404770696734]
We propose a novel approach that reduces the memory requirements by training small, structured works of the model on separate workers.<n>We evaluate two subnetwork construction strategies guided by the principle of ensuring uniform representation of each parameter.<n>Preliminary experiments highlight the promise of our approach, achieving a 20-40% reduction in memory usage without any loss in performance.
arXiv Detail & Related papers (2025-07-11T21:25:11Z)
Seesaw: High-throughput LLM Inference via Model Re-sharding [8.840996987380484]
We present Seesaw, an inference engine optimized for throughput-oriented tasks. Key idea behind Seesaw is dynamic model re-sharding, a technique that facilitates the dynamic reconfiguration of parallelization strategies.
arXiv Detail & Related papers (2025-03-09T04:14:06Z)
An Augmented Backward-Corrected Projector Splitting Integrator for Dynamical Low-Rank Training [47.69709732622765]
We introduce a novel low-rank training method that reduces the number of required QR decompositions. Our approach integrates an augmentation step into a projector-splitting scheme, ensuring convergence to a locally optimal solution.
arXiv Detail & Related papers (2025-02-05T09:03:50Z)
LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging [80.17238673443127]
LiNeS is a post-training editing technique designed to preserve pre-trained generalization while enhancing fine-tuned task performance. LiNeS demonstrates significant improvements in both single-task and multi-task settings across various benchmarks in vision and natural language processing.
arXiv Detail & Related papers (2024-10-22T16:26:05Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy [67.45518210171024]
Dynamic computation methods have shown notable acceleration for Large Language Models (LLMs) by skipping several layers of computations. We propose a Unified Layer Skipping strategy, which selects the number of layers to skip computation based solely on the target speedup ratio. Experimental results on two common tasks, i.e., machine translation and text summarization, indicate that given a target speedup ratio, the Unified Layer Skipping strategy significantly enhances both the inference performance and the actual model throughput.
arXiv Detail & Related papers (2024-04-10T12:12:07Z)
Efficient Stagewise Pretraining via Progressive Subnetworks [53.00045381931778]
The prevailing view suggests that stagewise dropping strategies, such as layer dropping, are ineffective when compared to stacking-based approaches. This paper challenges this notion by demonstrating that, with proper design, dropping strategies can be competitive, if not better, than stacking methods. We propose an instantiation of this framework - Random Part Training (RAPTR) - that selects and trains only a random subnetwork at each step, progressively increasing the size in stages.
arXiv Detail & Related papers (2024-02-08T18:49:09Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer. We show that there is a natural synergy between these two settings. We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z)
Practical Convex Formulation of Robust One-hidden-layer Neural Network Training [12.71266194474117]
We show that the training of a one-hidden-layer, scalar-output fully-connected ReLULU neural network can be reformulated as a finite-dimensional convex program. We derive a convex optimization approach to efficiently solve the "adversarial training" problem. Our method can be applied to binary classification and regression, and provides an alternative to the current adversarial training methods.
arXiv Detail & Related papers (2021-05-25T22:06:27Z)
Optimizer Fusion: Efficient Training with Better Locality and Parallelism [11.656318345362804]
Experimental results show that we can achieve an up to 20% training time reduction on various configurations. Since our methods do not alter the algorithm, they can be used as a general "plug-in" technique to the training process.
arXiv Detail & Related papers (2021-04-01T03:44:13Z)
A Practical Layer-Parallel Training Algorithm for Residual Networks [41.267919563145604]
gradient-based algorithms for training ResNets typically require a forward pass of the input data, followed by back-propagating the objective gradient to update parameters. We propose a novel serial-parallel hybrid training strategy to enable the use of data augmentation, together with downsampling filters to reduce the communication cost.
arXiv Detail & Related papers (2020-09-03T06:03:30Z)
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference [15.720414948573753]
We consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers) We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model. We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation.
arXiv Detail & Related papers (2020-08-19T06:44:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.