Breadth-First Pipeline Parallelism
- URL: http://arxiv.org/abs/2211.05953v2
- Date: Thu, 6 Jul 2023 19:03:41 GMT
- Title: Breadth-First Pipeline Parallelism
- Authors: Joel Lamy-Poirier
- Abstract summary: Breadth-First Pipeline Parallelism lowers training time, cost and memory usage.
It combines a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce Breadth-First Pipeline Parallelism, a novel training schedule
which optimizes the combination of pipeline and data parallelism. Breadth-First
Pipeline Parallelism lowers training time, cost and memory usage by combining a
high GPU utilization with a small batch size per GPU, and by making use of
fully sharded data parallelism. Experimentally, we observed an increase of up
to 43% in training throughput for a 52 billion-parameter model using a small
batch size per GPU compared to Megatron-LM, which would reduce the training
time and cost by the same amount on a large GPU cluster.
Related papers
- Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework Leveraging Local Learning [8.628231789161577]
We present PPLL (Pipeline Parallelism based on Local Learning), a novel framework that leverages local learning algorithms to enable effective parallel training across multiple GPU.
By utilizing queues to manage data transfers between GPU, PPLL ensures seamless cross- GPU communication, allowing multiple blocks to execute forward and backward passes in a pipelined manner.
Our results demonstrate that PPLL significantly enhances the training speed of the local learning method while achieving comparable or even superior training speed to traditional pipeline parallelism.
arXiv Detail & Related papers (2024-11-19T08:09:18Z) - BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training [5.7294516069851475]
BitPipe is a bidirectional interleaved pipeline parallelism for accelerating large models training.
We show that BitPipe improves the training throughput of GPT-style and BERT-style models by 1.05x-1.28x compared to the state-of-the-art synchronous approaches.
arXiv Detail & Related papers (2024-10-25T08:08:51Z) - PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training [8.637147484753948]
PipeFill fills pipeline bubbles with execution of other pending jobs.
Experiments show that PipeFill can increase overall utilization by up to 63% for GPUs used in large-scale LLM training.
arXiv Detail & Related papers (2024-09-23T22:39:05Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism [6.3568605707961]
Mini-batch training is commonly used to train Graph neural networks (GNNs) on large graphs.
In this paper, we introduce a hybrid parallel mini-batch training paradigm called split parallelism.
We show that split parallelism outperforms state-of-the-art mini-batch training systems like DGL, Quiver, and $P3$.
arXiv Detail & Related papers (2023-03-24T03:28:05Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training [9.551339069298011]
BaPipe is a pipeline parallelism training framework for distributed deep learning.
It automatically explores pipeline parallelism training methods and balanced partition strategies for distributed training.
BaPipe provides up to 3.2x speedup and 4x memory reduction in various platforms.
arXiv Detail & Related papers (2020-12-23T08:57:39Z) - Unsupervised Parallel Corpus Mining on Web Data [53.74427402568838]
We present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner.
Our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.
arXiv Detail & Related papers (2020-09-18T02:38:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.