Related papers: FlashMoE: Reducing SSD I/O Bottlenecks via ML-Based Cache Replacement for Mixture-of-Experts Inference on Edge Devices

FlashMoE: Reducing SSD I/O Bottlenecks via ML-Based Cache Replacement for Mixture-of-Experts Inference on Edge Devices

URL: http://arxiv.org/abs/2601.17063v1
Date: Thu, 22 Jan 2026 17:07:33 GMT
Title: FlashMoE: Reducing SSD I/O Bottlenecks via ML-Based Cache Replacement for Mixture-of-Experts Inference on Edge Devices
Authors: Byeongju Kim, Jungwan Lee, Donghyeon Han, Hoi-Jun Yoo, Sangyeob Kim,
Abstract summary: Mixture-of-Experts (MoE) models have gained attention for efficiently scaling large language models.<n>MoE models are extremely large, their sparse activation enables inference to be performed by accessing only a fraction of the model at a time.<n>We propose FlashMoE, a system that offloads inactive experts to SSD, enabling efficient MoE inference under limited RAM.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Mixture-of-Experts (MoE) models have gained attention for efficiently scaling large language models. Although these models are extremely large, their sparse activation enables inference to be performed by accessing only a fraction of the model at a time. This property opens the possibility of on-device inference of MoE, which was previously considered infeasible for such large models. Consequently, various systems have been proposed to leverage this sparsity and enable efficient MoE inference for edge devices. However, previous MoE inference systems like Fiddler[8] or DAOP[13] rely on DRAM-based offloading and are not suitable for memory constrained on-device environments. As recent MoE models grow to hundreds of gigabytes, RAM-offloading solutions become impractical. To address this, we propose FlashMoE, a system that offloads inactive experts to SSD, enabling efficient MoE inference under limited RAM. FlashMoE incorporates a lightweight ML-based caching strategy that adaptively combines recency and frequency signals to maximize expert reuse, significantly reducing storage I/O. In addition, we built a user-grade desktop platform to demonstrate the practicality of FlashMoE. On this real hardware setup, FlashMoE improves cache hit rate by up to 51% over well-known offloading policies such as LRU and LFU, and achieves up to 2.6x speedup compared to existing MoE inference systems.

Related papers

ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling [56.88966608455977]
ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters.<n>ZipMoE achieves up to $72.77%$ inference latency reduction and up to $6.76times$ higher throughput than the state-of-the-art systems.
arXiv Detail & Related papers (2026-01-29T02:51:59Z)
SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations [54.303301888915406]
Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost.<n>We propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching.<n>We also propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels.
arXiv Detail & Related papers (2025-12-16T04:39:10Z)
SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation [82.53411922988039]
We introduce SlimMoE, a multi-stage compression framework for transforming large MoE models into much smaller, efficient variants.<n>Using this framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to create Phi-mini-MoE (7.6B total/2.4B activated parameters) and Phi-tiny-MoE (3.8B total/1.1B activated parameters)<n>Our experiments demonstrate that these compressed models outperform others of similar size and remain competitive with larger models.
arXiv Detail & Related papers (2025-06-23T07:15:59Z)
FlashMoE: Fast Distributed MoE in a Single Kernel [1.866526462692252]
FlashMoE is a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel.<n>We show that FlashMoE achieves up to 9x higher GPU utilization, 6x lower latency, 5.7x higher throughput, and 4x better overlap efficiency compared to state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-05T06:29:14Z)
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration [61.579842548990754]
Mixture-of-Experts (MoE) Transformer, the backbone of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token.<n>We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones.
arXiv Detail & Related papers (2025-03-10T03:15:54Z)
Taming Latency-Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offloading [7.9192039061119255]
FineMoE is a fine-grained expert offloading system for MoE serving.<n>We show that FineMoE reduces inference latency by 47% and improves expert hit rate by 39% over state-of-the-art solutions.
arXiv Detail & Related papers (2025-02-07T22:51:17Z)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
ProMoE: Fast MoE-based LLM Serving using Proactive Caching [4.305883455005433]
We introduce ProMoE, a novel proactive caching system that utilizes intermediate results to predict subsequent expert usage.<n>ProMoE achieves an average speedup of 2.20x (up to 3.21x) and 2.07x (up to 5.02x) in the prefill and decode stages, respectively.
arXiv Detail & Related papers (2024-10-29T15:31:27Z)
MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models [15.346491299728463]
MoNDE reduces the volume of MoE parameter movement by transferring only the $textithot$ experts to the GPU. MoNDE enables far more communication-efficient MoE inference, thereby resulting in substantial speedups.
arXiv Detail & Related papers (2024-05-29T07:23:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.