MoE-Beyond: Learning-Based Expert Activation Prediction on Edge Devices
- URL: http://arxiv.org/abs/2508.17137v1
- Date: Sat, 23 Aug 2025 20:28:32 GMT
- Title: MoE-Beyond: Learning-Based Expert Activation Prediction on Edge Devices
- Authors: Nishant Gavhane, Arush Mehrotra, Rohit Chawla, Peter Proenca,
- Abstract summary: We introduce MoE-Beyond, a learning-based expert activation predictor trained to predict expert activations during autoregressive decoding.<n>Our predictor generalizes effectively across unseen prompts from WebGLM-QA dataset, achieving 97.5% accuracy and an 86.6% F1-score.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The deployment of large-scale Mixture-of-Experts (MoE) models on edge devices presents significant challenges due to memory constraints. While MoE architectures enable efficient utilization of computational resources by activating only a subset of experts per inference, they require careful memory management to operate efficiently in resource-constrained environments. Traditional heuristic-based expert caching strategies such as MoE-Infinity struggle to maintain high cache hit rates as models parameters scale. In this work, we introduce MoE-Beyond, a learning-based expert activation predictor trained to predict expert activations during autoregressive decoding. By framing the task as a multi-label sequence prediction problem, we train a lightweight transformer model on 66 million expert activation traces extracted from LDJnr-Puffin dataset [5] using DeepSeek-V2-Chat-Lite MoE. Our predictor generalizes effectively across unseen prompts from WebGLM-QA dataset [6], achieving 97.5% accuracy and an 86.6% F1-score. Simulation results show that MoE-Beyond improves GPU cache hit rate from 17% to 72% when only 10% of experts fit in GPU cache, outperforming heuristic baselines.
Related papers
- Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning [56.129822832095726]
AdaMoE is a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models.<n>A substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.
arXiv Detail & Related papers (2025-10-16T04:52:57Z) - MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts [17.518573710849513]
MoBiLE is a plug-and-play offloading-based MoE inference framework with textitmixture of big-little experts.<n>MoBiLE achieves a speedup of 1.60x to 1.72x compared to the baseline on a consumer GPU system, with negligible degradation in accuracy.
arXiv Detail & Related papers (2025-10-14T10:22:44Z) - MC#: Mixture Compressor for Mixture-of-Experts Large Models [86.64315380917827]
Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation.<n>We propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning.
arXiv Detail & Related papers (2025-10-13T03:12:46Z) - LExI: Layer-Adaptive Active Experts for Efficient MoE Model Inference [2.8653469160349077]
We introduce LExI, a data-free optimization technique that determines the optimal number of active experts per layer in a pretrained MoE model.<n>Experiments on state-of-the-art language and vision MoE benchmarks demonstrate that LExI significantly outperforms traditional MoE pruning approaches in terms of inference efficiency.
arXiv Detail & Related papers (2025-09-02T19:01:46Z) - Enabling MoE on the Edge via Importance-Driven Expert Scheduling [21.860330824352527]
MoE is a key technique for scaling Large Language Models by activating only a subset of experts per query.<n>We leverage expert importance to guide decisions, substituting low-cached activated experts with functionally similar ones already cached in GPU memory.<n>This design reduces memory usage and data transfer, while largely eliminating PCIe overhead.
arXiv Detail & Related papers (2025-08-26T12:32:09Z) - MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models [36.730689832979365]
MoTE is a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint.<n>MoTE achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint.
arXiv Detail & Related papers (2025-06-17T11:53:49Z) - PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval [36.9586523272496]
Mixture-of-experts (MoE) architectures enable scaling large language models (LLMs) to vast parameter counts without a proportional rise in computational costs.<n>However, the significant memory demands of large MoE models hinder their deployment across various computational environments.<n>We introduce PreMoe, a novel framework that enables efficient deployment of massive MoE models in memory-constrained environments.
arXiv Detail & Related papers (2025-05-23T08:59:16Z) - eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference [6.642099288463585]
We propose eMoE, a memory efficient inference system for large language models (LLMs)<n>eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing.<n>It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.
arXiv Detail & Related papers (2025-03-10T01:11:52Z) - fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving [9.956997242640728]
fMoE is a fine-grained expert offloading system for MoE serving.<n>We show that fMoE reduces inference latency by 47% and improves expert hit rate by 36% over state-of-the-art solutions.
arXiv Detail & Related papers (2025-02-07T22:51:17Z) - CMoE: Converting Mixture-of-Experts from Dense to Accelerate LLM Inference [33.871080938643566]
We present CMoE, a framework that rapidly transforms dense language models into mixture-of-experts (MoEs) without training.<n>Experiments demonstrate that, with activation ratio of 75%, it achieves remarkable results in terms of perplexity.<n>A CMoE configuration activating just 25% of parameters reduces end-to-end latency by 1.5x while preserving usable perplexity without additional training.
arXiv Detail & Related papers (2025-02-06T14:05:30Z) - HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z) - GRIN: GRadient-INformed MoE [132.87651078514122]
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing.
We introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing.
Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
arXiv Detail & Related papers (2024-09-18T17:00:20Z) - SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models.
The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss.
Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.