Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing
- URL: http://arxiv.org/abs/2512.18674v1
- Date: Sun, 21 Dec 2025 10:27:50 GMT
- Title: Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing
- Authors: Wentao Liu, Yuhao Hu, Ruiting Zhou, Baochun Li, Ne Wang,
- Abstract summary: Mixture-of-Experts (MoE) has become a dominant architecture in large language models.<n>MoEs incurs high inference costs due to memory-intensive parameter caching.<n>We propose Remoe, a heterogeneous MoE inference system tailored for serverless computing.
- Score: 29.98726492279776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture-of-Experts (MoE) has become a dominant architecture in large language models (LLMs) due to its ability to scale model capacity via sparse expert activation. Meanwhile, serverless computing, with its elasticity and pay-per-use billing, is well-suited for deploying MoEs with bursty workloads. However, the large number of experts in MoE models incurs high inference costs due to memory-intensive parameter caching. These costs are difficult to mitigate via simple model partitioning due to input-dependent expert activation. To address these issues, we propose Remoe, a heterogeneous MoE inference system tailored for serverless computing. Remoe assigns non-expert modules to GPUs and expert modules to CPUs, and further offloads infrequently activated experts to separate serverless functions to reduce memory overhead and enable parallel execution. We incorporate three key techniques: (1) a Similar Prompts Searching (SPS) algorithm to predict expert activation patterns based on semantic similarity of inputs; (2) a Main Model Pre-allocation (MMP) algorithm to ensure service-level objectives (SLOs) via worst-case memory estimation; and (3) a joint memory and replica optimization framework leveraging Lagrangian duality and the Longest Processing Time (LPT) algorithm. We implement Remoe on Kubernetes and evaluate it across multiple LLM benchmarks. Experimental results show that Remoe reduces inference cost by up to 57% and cold start latency by 47% compared to state-of-the-art baselines.
Related papers
- MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models [13.907916161242794]
Mixture-of-Experts (MoE) model architectures can significantly reduce the number of activated parameters per token.<n>Their large overall parameter counts and model sizes have precluded their widespread usage in resource-constrained settings.<n>We present MELINOE, a method that fine-tunes an MoE model to more strongly prefer activating a smaller number of experts per sequence.
arXiv Detail & Related papers (2026-01-30T14:40:18Z) - ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling [56.88966608455977]
ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters.<n>ZipMoE achieves up to $72.77%$ inference latency reduction and up to $6.76times$ higher throughput than the state-of-the-art systems.
arXiv Detail & Related papers (2026-01-29T02:51:59Z) - MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts [29.437264687850874]
We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading.<n>MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens.<n>Our evaluation on memory-constrained devices shows that for the Phi-MoE model, MoE-SpeQ achieves at most 2.34x speedup over the state-of-the-art offloading framework.
arXiv Detail & Related papers (2025-11-18T03:40:19Z) - MC#: Mixture Compressor for Mixture-of-Experts Large Models [86.64315380917827]
Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation.<n>We propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning.
arXiv Detail & Related papers (2025-10-13T03:12:46Z) - Cluster Topology-Driven Placement of Experts Reduces Network Traffic in MoE Inference [49.141930185079325]
We propose an integer linear program (ILP) that determines the optimal placement of experts, minimizing the expected number of transmissions.<n>We demonstrate that ILP-based placement strategy yields lower network traffic than competitors for small-scale (DeepSeekMoE16B) and large-scale (DeepSeek-R1671B) models.
arXiv Detail & Related papers (2025-08-12T07:08:48Z) - D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving [14.607254882119507]
Combination of experts (MoE) model is a sparse variant of large language models (LLMs)<n>Despite its benefits, MoE is still too expensive to deploy on resource-constrained edge devices.<n>We propose D$2$MoE, an algorithm-system co-design framework that matches diverse task requirements by dynamically allocating the most proper bit-width to each expert.
arXiv Detail & Related papers (2025-04-17T05:37:35Z) - eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference [6.642099288463585]
We propose eMoE, a memory efficient inference system for large language models (LLMs)<n>eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing.<n>It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.
arXiv Detail & Related papers (2025-03-10T01:11:52Z) - Taming Latency-Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offloading [7.9192039061119255]
FineMoE is a fine-grained expert offloading system for MoE serving.<n>We show that FineMoE reduces inference latency by 47% and improves expert hit rate by 39% over state-of-the-art solutions.
arXiv Detail & Related papers (2025-02-07T22:51:17Z) - Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing [9.217991144854851]
Mixture-of-Experts (MoE) models have been a dominant type of model architectures nowadays.<n>We study optimized MoE model deployment and distributed inference serving on a serverless platform.<n>Our designs reduce the billed cost of all MoE layers by at least 75.67% compared to CPU clusters.
arXiv Detail & Related papers (2025-01-09T15:29:33Z) - HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - EdgeMoE: Empowering Sparse Large Language Models on Mobile Devices [3.3947808667959536]
EdgeMoE is an on-device inference engine for mixture-of-expert (MoE) LLMs.<n>Non-expert weights are held in device memory; while expert weights are held on external storage and fetched to memory only when activated.
arXiv Detail & Related papers (2023-08-28T06:56:08Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.