Related papers: MoE-Beyond: Learning-Based Expert Activation Prediction on Edge Devices

MoE-Beyond: Learning-Based Expert Activation Prediction on Edge Devices

URL: http://arxiv.org/abs/2508.17137v1
Date: Sat, 23 Aug 2025 20:28:32 GMT
Title: MoE-Beyond: Learning-Based Expert Activation Prediction on Edge Devices
Authors: Nishant Gavhane, Arush Mehrotra, Rohit Chawla, Peter Proenca,
Abstract summary: We introduce MoE-Beyond, a learning-based expert activation predictor trained to predict expert activations during autoregressive decoding.<n>Our predictor generalizes effectively across unseen prompts from WebGLM-QA dataset, achieving 97.5% accuracy and an 86.6% F1-score.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The deployment of large-scale Mixture-of-Experts (MoE) models on edge devices presents significant challenges due to memory constraints. While MoE architectures enable efficient utilization of computational resources by activating only a subset of experts per inference, they require careful memory management to operate efficiently in resource-constrained environments. Traditional heuristic-based expert caching strategies such as MoE-Infinity struggle to maintain high cache hit rates as models parameters scale. In this work, we introduce MoE-Beyond, a learning-based expert activation predictor trained to predict expert activations during autoregressive decoding. By framing the task as a multi-label sequence prediction problem, we train a lightweight transformer model on 66 million expert activation traces extracted from LDJnr-Puffin dataset [5] using DeepSeek-V2-Chat-Lite MoE. Our predictor generalizes effectively across unseen prompts from WebGLM-QA dataset [6], achieving 97.5% accuracy and an 86.6% F1-score. Simulation results show that MoE-Beyond improves GPU cache hit rate from 17% to 72% when only 10% of experts fit in GPU cache, outperforming heuristic baselines.

Related papers

Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning [56.129822832095726]
AdaMoE is a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models.<n>A substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.
arXiv Detail & Related papers (2025-10-16T04:52:57Z)
MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts [17.518573710849513]
MoBiLE is a plug-and-play offloading-based MoE inference framework with textitmixture of big-little experts.<n>MoBiLE achieves a speedup of 1.60x to 1.72x compared to the baseline on a consumer GPU system, with negligible degradation in accuracy.
arXiv Detail & Related papers (2025-10-14T10:22:44Z)
MC#: Mixture Compressor for Mixture-of-Experts Large Models [86.64315380917827]
Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation.<n>We propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning.
arXiv Detail & Related papers (2025-10-13T03:12:46Z)
LExI: Layer-Adaptive Active Experts for Efficient MoE Model Inference [2.8653469160349077]
We introduce LExI, a data-free optimization technique that determines the optimal number of active experts per layer in a pretrained MoE model.<n>Experiments on state-of-the-art language and vision MoE benchmarks demonstrate that LExI significantly outperforms traditional MoE pruning approaches in terms of inference efficiency.
arXiv Detail & Related papers (2025-09-02T19:01:46Z)
Enabling MoE on the Edge via Importance-Driven Expert Scheduling [21.860330824352527]
MoE is a key technique for scaling Large Language Models by activating only a subset of experts per query.<n>We leverage expert importance to guide decisions, substituting low-cached activated experts with functionally similar ones already cached in GPU memory.<n>This design reduces memory usage and data transfer, while largely eliminating PCIe overhead.
arXiv Detail & Related papers (2025-08-26T12:32:09Z)
MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models [36.730689832979365]
MoTE is a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint.<n>MoTE achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint.
arXiv Detail & Related papers (2025-06-17T11:53:49Z)
PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval [36.9586523272496]
Mixture-of-experts (MoE) architectures enable scaling large language models (LLMs) to vast parameter counts without a proportional rise in computational costs.<n>However, the significant memory demands of large MoE models hinder their deployment across various computational environments.<n>We introduce PreMoe, a novel framework that enables efficient deployment of massive MoE models in memory-constrained environments.
arXiv Detail & Related papers (2025-05-23T08:59:16Z)
eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference [6.642099288463585]
We propose eMoE, a memory efficient inference system for large language models (LLMs)<n>eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing.<n>It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.
arXiv Detail & Related papers (2025-03-10T01:11:52Z)
fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving [9.956997242640728]
fMoE is a fine-grained expert offloading system for MoE serving.<n>We show that fMoE reduces inference latency by 47% and improves expert hit rate by 36% over state-of-the-art solutions.
arXiv Detail & Related papers (2025-02-07T22:51:17Z)
CMoE: Converting Mixture-of-Experts from Dense to Accelerate LLM Inference [33.871080938643566]
We present CMoE, a framework that rapidly transforms dense language models into mixture-of-experts (MoEs) without training.<n>Experiments demonstrate that, with activation ratio of 75%, it achieves remarkable results in terms of perplexity.<n>A CMoE configuration activating just 25% of parameters reduces end-to-end latency by 1.5x while preserving usable perplexity without additional training.
arXiv Detail & Related papers (2025-02-06T14:05:30Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z)
GRIN: GRadient-INformed MoE [132.87651078514122]
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing. We introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
arXiv Detail & Related papers (2024-09-18T17:00:20Z)
SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.