DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines
- URL: http://arxiv.org/abs/2311.10418v1
- Date: Fri, 17 Nov 2023 09:48:45 GMT
- Title: DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines
- Authors: Chenyu Jiang, Zhen Jia, Shuai Zheng, Yida Wang, Chuan Wu
- Abstract summary: This paper proposes a dynamic micro-batching approach to tackle sequence length variation and enable efficient multi-task model training.
We optimize micro-batch construction using a dynamic programming-based approach, and handle micro-batch execution time variation through dynamic pipeline and communication scheduling.
- Score: 15.332562681746081
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-task model training has been adopted to enable a single deep neural
network model (often a large language model) to handle multiple tasks (e.g.,
question answering and text summarization). Multi-task training commonly
receives input sequences of highly different lengths due to the diverse
contexts of different tasks. Padding (to the same sequence length) or packing
(short examples into long sequences of the same length) is usually adopted to
prepare input samples for model training, which is nonetheless not space or
computation efficient. This paper proposes a dynamic micro-batching approach to
tackle sequence length variation and enable efficient multi-task model
training. We advocate pipeline-parallel training of the large model with
variable-length micro-batches, each of which potentially comprises a different
number of samples. We optimize micro-batch construction using a dynamic
programming-based approach, and handle micro-batch execution time variation
through dynamic pipeline and communication scheduling, enabling highly
efficient pipeline training. Extensive evaluation on the FLANv2 dataset
demonstrates up to 4.39x higher training throughput when training T5, and 3.25x
when training GPT, as compared with packing-based baselines. DynaPipe's source
code is publicly available at
https://github.com/awslabs/optimizing-multitask-training-through-dynamic-pipelines.
Related papers
- Code Review Automation Via Multi-task Federated LLM -- An Empirical Study [4.8342038441006805]
The study explores five simple techniques for multi-task training, including two sequential methods, one parallel method, and two cumulative methods.
The results indicate that sequentially training a federated LLM (FedLLM) for our code review multi-task use case is less efficient in terms of time, computation, and performance metrics, compared to training separate models for each task.
arXiv Detail & Related papers (2024-12-20T08:46:46Z) - Instruction Pre-Training: Language Models are Supervised Multitask Learners [115.95022434390181]
In this paper, we propose a framework that augments massive raw corpora with instruction-response pairs to pre-train language models (LMs)
In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training.
arXiv Detail & Related papers (2024-06-20T16:55:33Z) - Prototype-based HyperAdapter for Sample-Efficient Multi-task Tuning [30.251155072822055]
Prototype-based HyperAdapter (PHA) is a novel framework built on the adapter-tuning and hypernetwork.
It introduces an instance-dense retriever and prototypical hypernetwork to generate conditional modules in a sample-efficient manner.
We show that PHA strikes a better trade-off between trainable parameters, accuracy on stream tasks, and sample efficiency.
arXiv Detail & Related papers (2023-10-18T02:42:17Z) - Scalarization for Multi-Task and Multi-Domain Learning at Scale [15.545810422759295]
Training a single model on multiple input domains and/or output tasks allows for compressing information from multiple sources into a unified backbone.
However, optimizing such networks is a challenge due to discrepancies between the different tasks or domains.
arXiv Detail & Related papers (2023-10-13T07:31:04Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision
Tasks [36.34331439747556]
We propose Polyhistor and Polyhistor-Lite to share information across different tasks with a few trainable parameters.
Specifically, Polyhistor achieves competitive accuracy compared to the state-of-the-art while only using 10% of their trainable parameters.
arXiv Detail & Related papers (2022-10-07T00:25:02Z) - DiSparse: Disentangled Sparsification for Multitask Model Compression [92.84435347164435]
DiSparse is a simple, effective, and first-of-its-kind multitask pruning and sparse training scheme.
Our experimental results demonstrate superior performance on various configurations and settings.
arXiv Detail & Related papers (2022-06-09T17:57:46Z) - PolyViT: Co-training Vision Transformers on Images, Videos and Audio [80.0913507142036]
We present PolyViT, a model trained on image, audio and video.
By co-training different tasks on a single modality, we are able to improve the accuracy of each individual task.
We show that co-training is simple and practical to implement.
arXiv Detail & Related papers (2021-11-25T10:01:05Z) - ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning [56.54359715403561]
This paper introduces ExMix, a massive collection of 107 supervised NLP tasks across diverse domains and task-families.
Using ExMix, we study the effect of multi-task pre-training at the largest scale to date, and analyze co-training transfer amongst common families of tasks.
We propose ExT5, a model pre-trained using a multi-task objective of self-supervised span denoising and supervised ExMix.
arXiv Detail & Related papers (2021-11-22T02:34:46Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.