On Optimizing the Communication of Model Parallelism
- URL: http://arxiv.org/abs/2211.05322v2
- Date: Sun, 18 Aug 2024 16:24:54 GMT
- Title: On Optimizing the Communication of Model Parallelism
- Authors: Yonghao Zhuang, Hexu Zhao, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, Hao Zhang,
- Abstract summary: We study a novel and important communication pattern in large-scale model-parallel deep learning (DL)
In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh.
We propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule.
- Score: 74.15423270435949
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study a novel and important communication pattern in large-scale model-parallel deep learning (DL), which we call cross-mesh resharding. This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large clusters. In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh, on which the tensor may be distributed with the same or different layouts. We formalize this as a many-to-many multicast communication problem, and show that existing approaches either are sub-optimal or do not generalize to different network topologies or tensor layouts, which result from different model architectures and parallelism strategies. We then propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule. On microbenchmarks, our overall system outperforms existing ones by up to 10x across various tensor and mesh layouts. On end-to-end training of two large models, GPT-3 and U-Transformer, we improve throughput by 10% and 50%, respectively.
Related papers
- ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting.
atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput.
Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z) - Subnetwork-to-go: Elastic Neural Network with Dynamic Training and
Customizable Inference [16.564868336748503]
We propose a simple way to train a large network and flexibly extract a subnetwork from it given a model size or complexity constraint.
Experiment results on a music source separation model show that our proposed method can effectively improve the separation performance across different subnetwork sizes and complexities.
arXiv Detail & Related papers (2023-12-06T12:40:06Z) - AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation [80.33846577924363]
We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video framegithub.
It is based on two essential designs. First, we build bidirectional volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations.
Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately.
arXiv Detail & Related papers (2023-04-19T16:18:47Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Decentralized Training of Foundation Models in Heterogeneous
Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive.
We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z) - Densely connected multidilated convolutional networks for dense
prediction tasks [25.75557472306157]
We propose a novel CNN architecture called densely connected multidilated DenseNet (D3Net)
D3Net involves a novel multidilated convolution that has different dilation factors in a single layer to model different resolutions simultaneously.
Experiments on the image semantic segmentation task using Cityscapes and the audio source separation task using MUSDB18 show that the proposed method has superior performance over state-of-the-art methods.
arXiv Detail & Related papers (2020-11-21T05:15:12Z) - A Multi-Semantic Metapath Model for Large Scale Heterogeneous Network
Representation Learning [52.83948119677194]
We propose a multi-semantic metapath (MSM) model for large scale heterogeneous representation learning.
Specifically, we generate multi-semantic metapath-based random walks to construct the heterogeneous neighborhood to handle the unbalanced distributions.
We conduct systematical evaluations for the proposed framework on two challenging datasets: Amazon and Alibaba.
arXiv Detail & Related papers (2020-07-19T22:50:20Z) - A Linear Algebraic Approach to Model Parallelism in Deep Learning [0.0]
Training deep neural networks (DNNs) in large-cluster computing environments is increasingly necessary, as networks grow in size and complexity.
We propose a linear-algebraic approach to model parallelism in deep learning, which allows parallel distribution of any tensor in the DNN.
We build distributed DNN layers using these parallel primitives, composed with sequential layer implementations, and demonstrate their application by building and training a distributed DNN using DistDL, a PyTorch and MPI-based distributed deep learning toolkit.
arXiv Detail & Related papers (2020-06-04T19:38:05Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.