BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training
- URL: http://arxiv.org/abs/2012.12544v2
- Date: Thu, 14 Jan 2021 05:58:54 GMT
- Title: BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training
- Authors: Letian Zhao, Rui Xu, Tianqi Wang, Teng Tian, Xiaotian Wang, Wei Wu,
Chio-in Ieong, Xi Jin
- Abstract summary: BaPipe is a pipeline parallelism training framework for distributed deep learning.
It automatically explores pipeline parallelism training methods and balanced partition strategies for distributed training.
BaPipe provides up to 3.2x speedup and 4x memory reduction in various platforms.
- Score: 9.551339069298011
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The size of deep neural networks (DNNs) grows rapidly as the complexity of
the machine learning algorithm increases. To satisfy the requirement of
computation and memory of DNN training, distributed deep learning based on
model parallelism has been widely recognized. We propose a new pipeline
parallelism training framework, BaPipe, which can automatically explore
pipeline parallelism training methods and balanced partition strategies for DNN
distributed training. In BaPipe, each accelerator calculates the forward
propagation and backward propagation of different parts of networks to
implement the intra-batch pipeline parallelism strategy. BaPipe uses a new load
balancing automatic exploration strategy that considers the parameters of DNN
models and the computation, memory, and communication resources of accelerator
clusters. We have trained different DNNs such as VGG-16, ResNet-50, and GNMT on
GPU clusters and simulated the performance of different FPGA clusters. Compared
with state-of-the-art data parallelism and pipeline parallelism frameworks,
BaPipe provides up to 3.2x speedup and 4x memory reduction in various
platforms.
Related papers
- Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework Leveraging Local Learning [8.628231789161577]
We present PPLL (Pipeline Parallelism based on Local Learning), a novel framework that leverages local learning algorithms to enable effective parallel training across multiple GPU.
By utilizing queues to manage data transfers between GPU, PPLL ensures seamless cross- GPU communication, allowing multiple blocks to execute forward and backward passes in a pipelined manner.
Our results demonstrate that PPLL significantly enhances the training speed of the local learning method while achieving comparable or even superior training speed to traditional pipeline parallelism.
arXiv Detail & Related papers (2024-11-19T08:09:18Z) - GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism [20.44114440511298]
Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device.
This paper presents a new pipeline-parallel scheme that partitions a DNN into pipeline stages whose dependencies are identified by a directed acyclic graph.
We also develop GraphPipe, a distributed system that exploits MME strategies to enable performant and scalable DNN training.
arXiv Detail & Related papers (2024-06-24T21:32:51Z) - 2BP: 2-Stage Backpropagation [0.0]
This paper introduces 2-stage backpropagation (2BP)
By splitting the backward propagation step into two separate stages, we can reduce idle compute time.
Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods.
arXiv Detail & Related papers (2024-05-28T11:02:01Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - Receptive Field-based Segmentation for Distributed CNN Inference
Acceleration in Collaborative Edge Computing [93.67044879636093]
We study inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing network.
We propose a novel collaborative edge computing using fused-layer parallelization to partition a CNN model into multiple blocks of convolutional layers.
arXiv Detail & Related papers (2022-07-22T18:38:11Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - Parareal Neural Networks Emulating a Parallel-in-time Algorithm [1.988145627448243]
As deep neural networks (DNNs) become deeper, the training time increases.
In this paper, we introduce a novel methodology to construct a parallel neural network.
arXiv Detail & Related papers (2021-03-16T02:03:39Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation.
We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z) - A Linear Algebraic Approach to Model Parallelism in Deep Learning [0.0]
Training deep neural networks (DNNs) in large-cluster computing environments is increasingly necessary, as networks grow in size and complexity.
We propose a linear-algebraic approach to model parallelism in deep learning, which allows parallel distribution of any tensor in the DNN.
We build distributed DNN layers using these parallel primitives, composed with sequential layer implementations, and demonstrate their application by building and training a distributed DNN using DistDL, a PyTorch and MPI-based distributed deep learning toolkit.
arXiv Detail & Related papers (2020-06-04T19:38:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.