MoE-Infinity: Offloading-Efficient MoE Model Serving
- URL: http://arxiv.org/abs/2401.14361v2
- Date: Thu, 1 Aug 2024 13:21:24 GMT
- Title: MoE-Infinity: Offloading-Efficient MoE Model Serving
- Authors: Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina,
- Abstract summary: MoE-Infinity is an offloading-efficient serving system for sparse mixture-of-experts (MoE) models.
To optimize offloading, MoE-Infinity achieves novel request-level tracing for expert activation.
MoE-Infinity exhibits superior latency performance, providing 2-20X improvements when serving various MoE models.
- Score: 15.826989637041907
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper presents MoE-Infinity, an offloading-efficient serving system for sparse mixture-of-experts (MoE) models. To optimize offloading, MoE-Infinity achieves novel request-level tracing for expert activation, capturing MoE's sparse execution patterns such as selective activation, group activation, and skewed reuse. Leveraging the request-level trace, MoE-Infinity performs effective expert prefetching and expert caching, achieving high efficiency in transferring model parameters from host memory to GPU memory. Experimental results demonstrate that MoE-Infinity achieves low latency comparable to expensive full-GPU deployments, which require up to 4X more GPU resources than MoE-Infinity. Compared to offloading-supporting LLM serving systems such as DeepSpeed-Inference, Llama.cpp, Mixtral Offloading, and BrainStorm, MoE-Infinity exhibits superior latency performance, providing 2-20X improvements when serving various MoE models for a large collection of LLM tasks. MoE-Infinity's source code is publicly available a https://github.com/TorchMoE/MoE-Infinity
Related papers
- ProMoE: Fast MoE-based LLM Serving using Proactive Caching [2.041412657843408]
Mixture-of-Experts (MoE) models help mitigate this issue by activating only a subset of the model's parameters during computation.
We propose ProMoE, a novel proactive caching system that leverages intermediate model results to predict subsequent parameter usage.
Our evaluations demonstrate that ProMoE achieves an average speedup of 2.13x and 2.84x in the prefill and decode stages respectively.
arXiv Detail & Related papers (2024-10-29T15:31:27Z) - MoDification: Mixture of Depths Made Easy [36.3113087767816]
mixture of depths (MoD) is proposed as a perfect fit to bring down both latency and memory.
MoDification can achieve up to 1.2x speedup in latency and 1.8x reduction in memory.
arXiv Detail & Related papers (2024-10-18T08:22:07Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks [58.075367597860044]
Training MoE models from scratch requires extensive data and computational resources.
We introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints into MoE models.
Our experiments on vision tasks demonstrate that MoE Jetpack significantly improves convergence speed and accuracy.
arXiv Detail & Related papers (2024-06-07T10:05:42Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes.
This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z) - SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models [20.16600129902895]
Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models.
Yet, the realization of such benefits often results in ineffective GPU memory utilization.
We introduce SiDA-MoE, an efficient inference approach tailored for large MoE models.
arXiv Detail & Related papers (2023-10-29T01:08:55Z) - Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [23.207326766883405]
Mixture-of-Experts (MoE) is able to scale its model size without proportionally scaling up its computational requirements.
Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation.
We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality.
arXiv Detail & Related papers (2023-08-23T11:25:37Z) - Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training.
MoE is hard to be deployed on cloud or mobile environment.
We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.