Related papers: MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache

MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache

URL: http://arxiv.org/abs/2401.14361v3
Date: Wed, 12 Mar 2025 18:14:21 GMT
Title: MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache
Authors: Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina,
Abstract summary: MoE-Infinity is an efficient MoE inference system designed for personal machines with limited GPU memory capacity.<n>By analyzing selected traces, MoE-Infinity guides the replacement and prefetching of the expert cache, providing 3.1-16.7x per-token latency improvements.
Score: 15.826989637041907
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper presents MoE-Infinity, an efficient MoE inference system designed for personal machines with limited GPU memory capacity. The key idea for MoE-Infinity is that on personal machines, which are often single-user environments, MoE-based LLMs typically operate with a batch size of one. In this setting, MoE models exhibit a high degree of activation sparsity, meaning a small number of experts are frequently reused in generating tokens during the decode phase. Leveraging this idea, we design a sparsity-aware expert cache, which can trace the sparse activation of experts during inference and carefully select the trace that represents the sparsity pattern. By analyzing these selected traces, MoE-Infinity guides the replacement and prefetching of the expert cache, providing 3.1-16.7x per-token latency improvements over numerous state-of-the-art systems, including vLLM, Ollama, DeepSpeed and BrainStorm across various MoE models (DeepSeek and Mixtral) when handling different LLM tasks. MoE-Infinity's source code is publicly available at https://github.com/EfficientMoE/MoE-Infinity

Related papers

Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models [35.617468386609254]
Mixture-of-Experts (MoE) enables efficient scaling of large language models with sparsely activated experts during inference.<n>Many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand.<n>We show that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency.
arXiv Detail & Related papers (2025-05-21T22:13:09Z)
Mixture of Lookup Experts [63.787712153454464]
Mixture-of-Experts (MoE) activates only a subset of experts during inference. MoLE is a new MoE architecture that is efficient in both communication and VRAM usage.
arXiv Detail & Related papers (2025-03-20T02:31:57Z)
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration [61.579842548990754]
Mixture-of-Experts (MoE) Transformer, the backbone of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones.
arXiv Detail & Related papers (2025-03-10T03:15:54Z)
fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving [9.956997242640728]
fMoE is a fine-grained expert offloading system for MoE serving. We show that fMoE reduces inference latency by 47% and improves expert hit rate by 36% over state-of-the-art solutions.
arXiv Detail & Related papers (2025-02-07T22:51:17Z)
Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference [14.57414071160821]
We introduce a novel cache-aware routing strategy that leverages expert reuse during token generation to improve cache locality. We present on-device results demonstrating 2$times$ speedups on mobile devices.
arXiv Detail & Related papers (2024-11-27T18:59:48Z)
Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning [26.945473092961123]
We propose ConDense-MoE, which condenses the large, sparse MoE layer into a smaller, denser layer with only a few experts activated for all tokens. Our approach is specifically designed for fine-grained MoE with shared experts, where Feed-Forward Networks are split into many small experts.
arXiv Detail & Related papers (2024-11-26T00:56:18Z)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
ProMoE: Fast MoE-based LLM Serving using Proactive Caching [2.041412657843408]
Mixture-of-Experts (MoE) models help mitigate this issue by activating only a subset of the model's parameters during computation. We propose ProMoE, a novel proactive caching system that leverages intermediate model results to predict subsequent parameter usage. Our evaluations demonstrate that ProMoE achieves an average speedup of 2.13x and 2.84x in the prefill and decode stages respectively.
arXiv Detail & Related papers (2024-10-29T15:31:27Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE. Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts. MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z)
A Closer Look into Mixture-of-Experts in Large Language Models [26.503570706063634]
Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance. MoE architecture could increase the model size without sacrificing computational efficiency. We make an initial attempt to understand the inner workings of MoE-based large language models.
arXiv Detail & Related papers (2024-06-26T10:07:57Z)
MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks [58.075367597860044]
Training MoE models from scratch requires extensive data and computational resources. We introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints into MoE models. Our experiments on vision tasks demonstrate that MoE Jetpack significantly improves convergence speed and accuracy.
arXiv Detail & Related papers (2024-06-07T10:05:42Z)
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance. We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z)
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z)
SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models [20.16600129902895]
Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models. Yet, the realization of such benefits often results in ineffective GPU memory utilization. We introduce SiDA-MoE, an efficient inference approach tailored for large MoE models.
arXiv Detail & Related papers (2023-10-29T01:08:55Z)
EdgeMoE: Empowering Sparse Large Language Models on Mobile Devices [3.3947808667959536]
EdgeMoE is an on-device inference engine for mixture-of-expert (MoE) LLMs. Non-expert weights are held in device memory; while expert weights are held on external storage and fetched to memory only when activated.
arXiv Detail & Related papers (2023-08-28T06:56:08Z)
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [23.207326766883405]
Mixture-of-Experts (MoE) is able to scale its model size without proportionally scaling up its computational requirements. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality.
arXiv Detail & Related papers (2023-08-23T11:25:37Z)
Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training. MoE is hard to be deployed on cloud or mobile environment. We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.