Related papers: EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

URL: http://arxiv.org/abs/2410.12247v2
Date: Fri, 03 Jan 2025 06:19:14 GMT
Title: EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
Authors: Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai,
Abstract summary: This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
Score: 49.94169109038806
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Mixture-of-Experts (MoE) model has emerged as a prominent architecture in the field of Large Language Models (LLMs), providing a better balance between model performance and computational efficiency. However the General Matrix Multiply (GEMM) operations and large parameters introduce challenges related to computational efficiency and communication overhead, which become throughput bottlenecks during inference. Applying a single parallelism strategy like EP, DP, TP or a straightforward combination of them to MoE usually achieves sub-optimal inference throughput. This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes. Our approach optimizes the computation of MoE FeedForward Network (FFN) modules by dynamically selecting the best kernel implementation of GroupGemm and DenseGemm for different loads and adaptively overlapping these computations with communication, leading to a substantial increase in throughput. Our experimental results demonstrate at most 52.4\% improvement in prefill throughput compared to existing parallel inference methods. Specifically, our method accelerated the highly optimized DeepSeekV2 model from a claimed 100K tokens per second to at least 120K tokens per second.

Related papers

dInfer: An Efficient Inference Framework for Diffusion Language Models [54.80918957287927]
Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs.<n>We present dInfer, an efficient and framework for dLLM inference.
arXiv Detail & Related papers (2025-10-09T16:19:42Z)
AdaPtis: Reducing Pipeline Bubbles with Adaptive Pipeline Parallelism on Heterogeneous Models [59.7059443712562]
AdaPtis is a training system for large language models (LLMs) that supports adaptive pipeline parallelism.<n>Extensive experiments show that AdaPtis achieves an average speedup of 1.42x (up to 2.14x) over Megatron-LM I-1F1B.
arXiv Detail & Related papers (2025-09-28T08:05:13Z)
Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.69640966866059]
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models.<n>In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs.<n>The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware.
arXiv Detail & Related papers (2025-05-07T15:46:36Z)
MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core [11.40633051522406]
We introduce an end-to-end training framework for large-scale MoE models. MoE Parallel Folding is a novel strategy that decouples the parallelization of attention and MoE in Transformer models. A flexible token-level dispatcher supports both token-dropping and token-dropless MoE training.
arXiv Detail & Related papers (2025-04-21T08:39:47Z)
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts [8.80408909878008]
Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters. Existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. We present COMET, an optimized MoE system with fine-grained communication-computation overlapping.
arXiv Detail & Related papers (2025-02-27T06:36:45Z)
FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models [21.96960353910023]
We introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques. We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters. FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations.
arXiv Detail & Related papers (2025-01-18T10:14:37Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models [16.16372459671255]
Large Language Models (LLMs) typically generate outputs token by token using a fixed compute budget. We propose a novel framework that integrates smaller auxiliary modules within each Feed-Forward Network layer of the LLM. We show that trained routers operate differently from oracles and often yield suboptimal solutions.
arXiv Detail & Related papers (2024-10-01T16:10:21Z)
Layerwise Recurrent Router for Mixture-of-Experts [42.36093735411238]
Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs. Current MoE models often display parameter inefficiency. We introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE)
arXiv Detail & Related papers (2024-08-13T10:25:13Z)
Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules [15.680276212483292]
We propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks. Parm achieves 1.13$times$ to 5.77$times$ speedup on 1296 manually configured MoE layers and approximately 3$times$ improvement on two real-world MoE models.
arXiv Detail & Related papers (2024-06-30T05:55:11Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance. We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z)
Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts [4.629608387540524]
We present a novel shortcut-connected MoE (ScMoE) architecture with an overlapping parallel strategy. ScMoE allows for a substantial overlap of 70% to 100% with computation. Building on the ScMoE architecture, we further implement an expert offloading strategy to facilitate memory-limited inference.
arXiv Detail & Related papers (2024-04-07T17:17:23Z)
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs) MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z)
Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism [91.9372563527801]
Existing MoE models suffer from tremendous inner-node and inter-node communication overhead. We propose a novel MoE architecture called Pipeline MoE (PPMoE) to tackle them. PPMoE builds expert parallel incorporating with tensor parallel and replaces communication-intensive all-to-all dispatching and gathering.
arXiv Detail & Related papers (2023-04-22T14:09:14Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Tutel: Adaptive Mixture-of-Experts at Scale [20.036168971435306]
Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost. We present Flex, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining. Our evaluation shows that Flex efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture.
arXiv Detail & Related papers (2022-06-07T15:20:20Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.