Related papers: Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

URL: http://arxiv.org/abs/2502.19811v3
Date: Tue, 04 Mar 2025 09:54:37 GMT
Title: Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
Authors: Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu,
Abstract summary: Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters.<n>Existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping.<n>We present COMET, an optimized MoE system with fine-grained communication-computation overlapping.
Score: 8.80408909878008
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by $1.96\times$ and for end-to-end execution, COMET delivers a $1.71\times$ speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.

Related papers

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
ISO: Overlap of Computation and Communication within Seqenence For LLM Inference [8.616769297336708]
This paper introduces a novel strategy for computation-communication overlap that operates at the sequence level. Experimental evaluations conducted using 30b/70b models have demonstrated significant improvements in efficiency.
arXiv Detail & Related papers (2024-09-04T05:22:17Z)
Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules [15.680276212483292]
We propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks. Parm achieves 1.13$times$ to 5.77$times$ speedup on 1296 manually configured MoE layers and approximately 3$times$ improvement on two real-world MoE models.
arXiv Detail & Related papers (2024-06-30T05:55:11Z)
ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training [16.560270624096706]
We propose a memory-efficient optimization algorithm tailored for distributed training of Large Language Models. Our method relies on a novel technique to mitigate the one-step delay inherent in parallel execution of gradient computations and communications.
arXiv Detail & Related papers (2024-06-03T08:23:45Z)
Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts [4.629608387540524]
We present a novel shortcut-connected MoE (ScMoE) architecture with an overlapping parallel strategy. ScMoE allows for a substantial overlap of 70% to 100% with computation. Building on the ScMoE architecture, we further implement an expert offloading strategy to facilitate memory-limited inference.
arXiv Detail & Related papers (2024-04-07T17:17:23Z)
LocMoE: A Low-Overhead MoE for Large Language Model Training [13.153904674287546]
We propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node. The proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers.
arXiv Detail & Related papers (2024-01-25T03:36:39Z)
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs) MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z)
Asynchronous Parallel Incremental Block-Coordinate Descent for Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing. For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data. This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles. Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center. We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes. A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)
Straggler-aware Distributed Learning: Communication Computation Latency Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations. In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.