Related papers: SpecMD: A Comprehensive Study On Speculative Expert Prefetching

SpecMD: A Comprehensive Study On Speculative Expert Prefetching

URL: http://arxiv.org/abs/2602.03921v1
Date: Tue, 03 Feb 2026 18:36:56 GMT
Title: SpecMD: A Comprehensive Study On Speculative Expert Prefetching
Authors: Duc Hoang, Ajay Jaiswal, Mohammad Samragh, Minsik Cho,
Abstract summary: Mixture-of-Experts (MoE) models enable sparse expert activation, meaning that only a subset of the model's parameters is used during each inference.<n>We propose textbfLeast-Stale, a novel eviction policy that exploits MoE's predictable expert access patterns to reduce collision misses by up to $85times$ over LRU.
Score: 15.35374861966937
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture-of-Experts (MoE) models enable sparse expert activation, meaning that only a subset of the model's parameters is used during each inference. However, to translate this sparsity into practical performance, an expert caching mechanism is required. Previous works have proposed hardware-centric caching policies, but how these various caching policies interact with each other and different hardware specification remains poorly understood. To address this gap, we develop \textbf{SpecMD}, a standardized framework for benchmarking ad-hoc cache policies on various hardware configurations. Using SpecMD, we perform an exhaustive benchmarking of several MoE caching strategies, reproducing and extending prior approaches in controlled settings with realistic constraints. Our experiments reveal that MoE expert access is not consistent with temporal locality assumptions (e.g LRU, LFU). Motivated by this observation, we propose \textbf{Least-Stale}, a novel eviction policy that exploits MoE's predictable expert access patterns to reduce collision misses by up to $85\times$ over LRU. With such gains, we achieve over $88\%$ hit rates with up to $34.7\%$ Time-to-first-token (TTFT) reduction on OLMoE at only $5\%$ or $0.6GB$ of VRAM cache capacity.

Related papers

$\ abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z)
MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts [29.437264687850874]
We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading.<n>MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens.<n>Our evaluation on memory-constrained devices shows that for the Phi-MoE model, MoE-SpeQ achieves at most 2.34x speedup over the state-of-the-art offloading framework.
arXiv Detail & Related papers (2025-11-18T03:40:19Z)
Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism [29.862588578556366]
MoEpic is an efficient MoE inference system with a novel expert split mechanism.<n>Experiments on popular MoE LLMs demonstrate that MoEpic can save about half of the GPU cost.
arXiv Detail & Related papers (2025-09-10T07:28:24Z)
Cache Management for Mixture-of-Experts LLMs -- extended version [29.858964433575906]
Large language models (LLMs) have demonstrated remarkable capabilities across a variety of tasks.<n>One of the main challenges towards the successful deployment of LLMs is memory management.<n>We introduce and study a new paging problem that models expert management optimization.
arXiv Detail & Related papers (2025-09-02T15:19:06Z)
Cost-Aware Contrastive Routing for LLMs [57.30288453580456]
We introduce Cost-Spectrum Contrastive Routing (CSCR), a lightweight framework that maps both prompts and models into a shared embedding space.<n>CSCR consistently outperforms baselines, improving the accuracy-cost tradeoff by up to 25%.
arXiv Detail & Related papers (2025-08-17T20:16:44Z)
$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts [55.231201692232894]
$textttSPECS$ is a latency-aware test-time scaling method inspired by speculative decoding.<n>Our results show that $textttSPECS$matches or surpasses beam search accuracy while reducing latency by up to $sim$19.1%.
arXiv Detail & Related papers (2025-06-15T05:50:05Z)
TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs [11.615399679746675]
Specializing large language models (LLMs) for local deployment in domain-specific use cases is necessary for strong performance.<n>We develop TrimLLM based on the layer-wise specialization phenomenon we empirically observed and verified on contemporary LLMs.<n>We show it retains LLMs' capacity in specific domains and inference speedup achieves irrespective of hardware and deep learning frameworks.
arXiv Detail & Related papers (2024-12-15T16:47:16Z)
Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference [14.57414071160821]
We introduce a novel cache-aware routing strategy that leverages expert reuse during token generation to improve cache locality.<n>We present on-device results demonstrating 2$times$ speedups on mobile devices.
arXiv Detail & Related papers (2024-11-27T18:59:48Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations. A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z)
Computationally Budgeted Continual Learning: What Does Matter? [128.0827987414154]
Continual Learning (CL) aims to sequentially train models on streams of incoming data that vary in distribution by preserving previous knowledge while adapting to new data. Current CL literature focuses on restricted access to previously seen data, while imposing no constraints on the computational budget for training. We revisit this problem with a large-scale benchmark and analyze the performance of traditional CL approaches in a compute-constrained setting.
arXiv Detail & Related papers (2023-03-20T14:50:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.