Related papers: BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference

BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference

URL: http://arxiv.org/abs/2511.10054v1
Date: Fri, 14 Nov 2025 01:29:18 GMT
Title: BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference
Authors: Yun Wang, Lingyun Yang, Senhao Yu, Yixiao Wang, Ruixing Li, Zhixiang Wei, James Yen, Zhengwei Qi,
Abstract summary: Growing size of modern MoE models causes their full parameter sets to exceed GPU memory capacity.<n>Prefetchings aim to hide this latency by predicting which experts are needed, but prefetch failures introduce significant stalls and amplify inference latency.<n>The critical challenge is to maintain both high inference speed and model accuracy when prefetching fails.
Score: 11.5035097836611
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) architectures scale language models by activating only a subset of specialized expert networks for each input token, thereby reducing the number of floating-point operations. However, the growing size of modern MoE models causes their full parameter sets to exceed GPU memory capacity; for example, Mixtral-8x7B has 45 billion parameters and requires 87 GB of memory even though only 14 billion parameters are used per token. Existing systems alleviate this limitation by offloading inactive experts to CPU memory, but transferring experts across the PCIe interconnect incurs significant latency (about 10 ms). Prefetching heuristics aim to hide this latency by predicting which experts are needed, but prefetch failures introduce significant stalls and amplify inference latency. In the event of a prefetch failure, prior work offers two primary solutions: either fetch the expert on demand, which incurs a long stall due to the PCIe bottleneck, or drop the expert from the computation, which significantly degrades model accuracy. The critical challenge, therefore, is to maintain both high inference speed and model accuracy when prefetching fails.

Related papers

ECO: Quantized Training without Full-Precision Master Weights [58.97082407934466]
Error-Compensating (ECO) eliminates master weights by applying updates directly to quantized parameters.<n>We show that ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate.
arXiv Detail & Related papers (2026-01-29T18:35:01Z)
MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts [29.437264687850874]
We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading.<n>MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens.<n>Our evaluation on memory-constrained devices shows that for the Phi-MoE model, MoE-SpeQ achieves at most 2.34x speedup over the state-of-the-art offloading framework.
arXiv Detail & Related papers (2025-11-18T03:40:19Z)
ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference [8.296993547783808]
ExpertFlow is a runtime system for MoE inference that combines adaptive expert prefetching and cache-aware routing.<n>Our evaluation demonstrates that ExpertFlow reduces model stall time to less than 0.1% of the baseline.
arXiv Detail & Related papers (2025-10-30T17:29:27Z)
MC#: Mixture Compressor for Mixture-of-Experts Large Models [86.64315380917827]
Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation.<n>We propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning.
arXiv Detail & Related papers (2025-10-13T03:12:46Z)
Enabling MoE on the Edge via Importance-Driven Expert Scheduling [21.860330824352527]
MoE is a key technique for scaling Large Language Models by activating only a subset of experts per query.<n>We leverage expert importance to guide decisions, substituting low-cached activated experts with functionally similar ones already cached in GPU memory.<n>This design reduces memory usage and data transfer, while largely eliminating PCIe overhead.
arXiv Detail & Related papers (2025-08-26T12:32:09Z)
FloE: On-the-Fly MoE Inference on Memory-constrained GPU [22.2581000412208]
FloE is built on the insight that there exists substantial untapped redundancy within sparsely activated experts.<n>FloE achieves a 9.3x compression of parameters per expert in Mixtral-8x7B.<n>It enables deployment on a GPU with only 11GB VRAM, reducing the memory footprint by up to 8.5x.
arXiv Detail & Related papers (2025-05-09T10:53:47Z)
Mixture of Lookup Experts [63.787712153454464]
Mixture-of-Experts (MoE) activates only a subset of experts during inference.<n>MoLE is a new MoE architecture that is efficient in both communication and VRAM usage.
arXiv Detail & Related papers (2025-03-20T02:31:57Z)
eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference [6.642099288463585]
We propose eMoE, a memory efficient inference system for large language models (LLMs)<n>eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing.<n>It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.
arXiv Detail & Related papers (2025-03-10T01:11:52Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.