PipeTransformer: Automated Elastic Pipelining for Distributed Training
of Transformers
- URL: http://arxiv.org/abs/2102.03161v1
- Date: Fri, 5 Feb 2021 13:39:31 GMT
- Title: PipeTransformer: Automated Elastic Pipelining for Distributed Training
of Transformers
- Authors: Chaoyang He, Shen Li, Mahdi Soltanolkotabi, Salman Avestimehr
- Abstract summary: PipeTransformer is a distributed training algorithm for Transformer models.
It automatically adjusts the pipelining and data parallelism by identifying and freezing some layers during the training.
We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on GLUE and SQuAD datasets.
- Score: 47.194426122333205
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The size of Transformer models is growing at an unprecedented pace. It has
only taken less than one year to reach trillion-level parameters after the
release of GPT-3 (175B). Training such models requires both substantial
engineering efforts and enormous computing resources, which are luxuries most
research teams cannot afford. In this paper, we propose PipeTransformer, which
leverages automated and elastic pipelining and data parallelism for efficient
distributed training of Transformer models. PipeTransformer automatically
adjusts the pipelining and data parallelism by identifying and freezing some
layers during the training, and instead allocates resources for training of the
remaining active layers. More specifically, PipeTransformer dynamically
excludes converged layers from the pipeline, packs active layers into fewer
GPUs, and forks more replicas to increase data-parallel width. We evaluate
PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on GLUE and
SQuAD datasets. Our results show that PipeTransformer attains a 2.4 fold
speedup compared to the state-of-the-art baseline. We also provide various
performance analyses for a more comprehensive understanding of our algorithmic
and system-wise design. We also develop open-sourced flexible APIs for
PipeTransformer, which offer a clean separation among the freeze algorithm,
model definitions, and training accelerations, hence allowing it to be applied
to other algorithms that require similar freezing strategies.
Related papers
- Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.
We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z) - PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference [5.704297874096985]
PipeFusion partitions images into patches and the model layers across multiple GPU.
It employs a patch-level pipeline parallel strategy to orchestrate communication and computation efficiently.
arXiv Detail & Related papers (2024-05-23T11:00:07Z) - Transformer as Linear Expansion of Learngene [38.16612771203953]
Linear Expansion of learnGene (TLEG) is a novel approach for flexibly producing and initializing Transformers of diverse depths.
Experiments on ImageNet-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch.
arXiv Detail & Related papers (2023-12-09T17:01:18Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Deep Pipeline Embeddings for AutoML [11.168121941015015]
AutoML is a promising direction for democratizing AI by automatically deploying Machine Learning systems with minimal human expertise.
Existing Pipeline Optimization techniques fail to explore deep interactions between pipeline stages/components.
This paper proposes a novel neural architecture that captures the deep interaction between the components of a Machine Learning pipeline.
arXiv Detail & Related papers (2023-05-23T12:40:38Z) - A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models.
Prior work on model pruning requires retraining the model.
We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - AutoWeka4MCPS-AVATAR: Accelerating Automated Machine Learning Pipeline
Composition and Optimisation [13.116806430326513]
We propose a novel method to evaluate the validity of ML pipelines, without their execution, using a surrogate model (AVATAR)
The AVATAR generates a knowledge base by automatically learning the capabilities and effects of ML algorithms on datasets' characteristics.
Instead of executing the original ML pipeline to evaluate its validity, the AVATAR evaluates its surrogate model constructed by capabilities and effects of the ML pipeline components.
arXiv Detail & Related papers (2020-11-21T14:05:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.