Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping
- URL: http://arxiv.org/abs/2404.19429v1
- Date: Tue, 30 Apr 2024 10:17:21 GMT
- Title: Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping
- Authors: Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang,
- Abstract summary: MoE technique plays crucial role in expanding the size of DNN model parameters.
Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation.
In our study, we extend the scope of this challenge by considering overlap at the broader training graph level.
We implement these techniques in Lancet, a system using compiler-based optimization to automatically enhance MoE model training.
- Score: 14.435637320909663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters. However, it faces the challenge of extended all-to-all communication latency during the training process. Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation. Yet, these methods frequently fall short of achieving sufficient overlap, consequently restricting the potential for performance enhancements. In our study, we extend the scope of this challenge by considering overlap at the broader training graph level. During the forward pass, we enable non-MoE computations to overlap with all-to-all through careful partitioning and pipelining. In the backward pass, we achieve overlap with all-to-all by scheduling gradient weight computations. We implement these techniques in Lancet, a system using compiler-based optimization to automatically enhance MoE model training. Our extensive evaluation reveals that Lancet significantly reduces the time devoted to non-overlapping communication, by as much as 77%. Moreover, it achieves a notable end-to-end speedup of up to 1.3 times when compared to the state-of-the-art solutions.
Related papers
- LocMoE: A Low-Overhead MoE for Large Language Model Training [13.153904674287546]
We propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node.
The proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers.
arXiv Detail & Related papers (2024-01-25T03:36:39Z) - Exploiting Inter-Layer Expert Affinity for Accelerating
Mixture-of-Experts Model Inference [3.217776693788795]
We propose a lightweight optimization technique called ExFlow to largely accelerate the inference of pre-trained MoE models.
By exploiting the inter-layer expert affinity, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation.
Our solution beats the cutting-edge MoE implementations with experts from 8 to 64, with up to 2.2x improvement in inference throughput.
arXiv Detail & Related papers (2024-01-16T14:16:47Z) - Fractional Deep Reinforcement Learning for Age-Minimal Mobile Edge
Computing [11.403989519949173]
This work focuses on the timeliness of computational-intensive updates, measured by Age-ofInformation (AoI)
We study how to jointly optimize the task updating and offloading policies for AoI with fractional form.
Experimental results show that our proposed algorithms reduce the average AoI by up to 57.6% compared with several non-fractional benchmarks.
arXiv Detail & Related papers (2023-12-16T11:13:40Z) - Diffusion for Natural Image Matting [93.86689168212241]
We present DiffMatte, a solution designed to overcome the challenges of image matting.
First, DiffMatte decouples the decoder from the intricately coupled matting network design, involving only one lightweight decoder in the iterations of the diffusion process.
Second, we employ a self-aligned training strategy with uniform time intervals, ensuring a consistent noise sampling between training and inference across the entire time domain.
arXiv Detail & Related papers (2023-12-10T15:28:56Z) - DeAR: Accelerating Distributed Deep Learning with Fine-Grained
All-Reduce Pipelining [22.168137965177284]
Communication scheduling has been shown to be effective in accelerating distributed training.
We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations.
We show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions.
arXiv Detail & Related papers (2023-02-24T04:11:18Z) - Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch
Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude.
We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z) - Learning to Optimize Permutation Flow Shop Scheduling via Graph-based
Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems.
We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately.
Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z) - Multi-task Over-the-Air Federated Learning: A Non-Orthogonal
Transmission Approach [52.85647632037537]
We propose a multi-task over-theair federated learning (MOAFL) framework, where multiple learning tasks share edge devices for data collection and learning models under the coordination of a edge server (ES)
Both the convergence analysis and numerical results demonstrate that the MOAFL framework can significantly reduce the uplink bandwidth consumption of multiple tasks without causing substantial learning performance degradation.
arXiv Detail & Related papers (2021-06-27T13:09:32Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.