Related papers: MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching

MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching

URL: http://arxiv.org/abs/2503.09716v1
Date: Wed, 12 Mar 2025 18:08:01 GMT
Title: MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
Authors: Tairan Xu, Leyang Xue, Zhan Lu, Adrian Jackson, Luo Mai,
Abstract summary: MoE-Gen is a high- throughput MoE inference system for singleGPU execution.<n>We introduce module-based tokens, which accumulates in host memory and dynamically launches large batches on to maximize utilization.<n>MoE-Gen achieves 8-31x higher throughput compared to state-of-the-art systems.
Score: 2.543762777822215
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper presents MoE-Gen, a high-throughput MoE inference system optimized for single-GPU execution. Existing inference systems rely on model-based or continuous batching strategies, originally designed for interactive inference, which result in excessively small batches for MoE's key modules-attention and expert modules-leading to poor throughput. To address this, we introduce module-based batching, which accumulates tokens in host memory and dynamically launches large batches on GPUs to maximize utilization. Additionally, we optimize the choice of batch sizes for each module in an MoE to fully overlap GPU computation and communication, maximizing throughput. Evaluation demonstrates that MoE-Gen achieves 8-31x higher throughput compared to state-of-the-art systems employing model-based batching (FlexGen, MoE-Lightning, DeepSpeed), and offers even greater throughput improvements over continuous batching systems (e.g., vLLM and Ollama) on popular MoE models (DeepSeek and Mixtral) across offline inference tasks. MoE-Gen's source code is publicly available at https://github.com/EfficientMoE/MoE-Gen

Related papers

MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints [7.287566040274871]
MoE-Lens is an inference system designed through holistic performance modeling for resource-constrained environments. It captures the system execution mechanisms to identify the key hardware bottlenecks and accurately predict the achievable throughput. evaluated on diverse MoE models and datasets, MoE-Lens outperforms the state-of-the-art solution by 4.6x on average (up to 25.5x)
arXiv Detail & Related papers (2025-04-12T21:26:56Z)
HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference [5.015541720729724]
HybriMoE is a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs.
arXiv Detail & Related papers (2025-04-08T10:47:37Z)
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism [26.923312725688735]
Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models.
arXiv Detail & Related papers (2025-04-03T04:20:44Z)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks [58.075367597860044]
Training MoE models from scratch requires extensive data and computational resources. We introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints into MoE models. Our experiments on vision tasks demonstrate that MoE Jetpack significantly improves convergence speed and accuracy.
arXiv Detail & Related papers (2024-06-07T10:05:42Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache [15.826989637041907]
MoE-Infinity is an efficient MoE inference system designed for personal machines with limited GPU memory capacity.<n>By analyzing selected traces, MoE-Infinity guides the replacement and prefetching of the expert cache, providing 3.1-16.7x per-token latency improvements.
arXiv Detail & Related papers (2024-01-25T18:07:50Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU. When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z)
MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services [32.278096820269816]
We present a novel MoESys that boosts efficiency in both large-scale training and inference. Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage. For scalable inference in a single node, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference.
arXiv Detail & Related papers (2022-05-20T09:09:27Z)
FastMoE: A Fast Mixture-of-Expert Training System [20.74001755688784]
Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of language model to trillions of parameters. FastMoE is a distributed MoE training system based on PyTorch with common accelerators.
arXiv Detail & Related papers (2021-03-24T15:27:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.