Related papers: TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading

TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading

URL: http://arxiv.org/abs/2603.01058v1
Date: Sun, 01 Mar 2026 11:27:37 GMT
Title: TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
Authors: Yudong Pan, Yintao He, Tianhua Han, Lian Liu, Shixin Zhao, Zhirong Chen, Mengdi Wang, Cangyuan Li, Yinhe Han, Ying Wang,
Abstract summary: TriMoE is a novel GPU- CPU-NDP architecture that exploits AMX-enabled CPU to map hot, warm, and cold experts onto their optimal compute units.<n> Experiments demonstrate that TriMoE achieves up to 2.83x speedup over state-of-the-art solutions.
Score: 38.243293392367086
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To deploy large Mixture-of-Experts (MoE) models cost-effectively, offloading-based single-GPU heterogeneous inference is crucial. While GPU-CPU architectures that offload cold experts are constrained by host memory bandwidth, emerging GPU-NDP architectures utilize DIMM-NDP to offload non-hot experts. However, non-hot experts are not a homogeneous memory-bound group: a significant subset of warm experts exists is severely penalized by high GPU I/O latency yet can saturate NDP compute throughput, exposing a critical compute gap. We present TriMoE, a novel GPU-CPU-NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units. We further introduce a bottleneck-aware expert scheduling policy and a prediction-driven dynamic relayout/rebalancing scheme. Experiments demonstrate that TriMoE achieves up to 2.83x speedup over state-of-the-art solutions.

Related papers

ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling [56.88966608455977]
ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters.<n>ZipMoE achieves up to $72.77%$ inference latency reduction and up to $6.76times$ higher throughput than the state-of-the-art systems.
arXiv Detail & Related papers (2026-01-29T02:51:59Z)
A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems [28.86723467729703]
Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing capabilities that offload experts to dedicated processing units.<n> deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching.
arXiv Detail & Related papers (2026-01-07T15:02:57Z)
WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving [17.92164698813269]
Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance.<n>We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads.<n>WarmServe improves TTFT by up to 50.8$times$ compared to the state-of-the-art autoscaling-based system.
arXiv Detail & Related papers (2025-12-10T09:47:40Z)
Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems [13.222990686403962]
Mixture-of-Experts (MoE) models scale large language models through conditional computation, but inference becomes memory-bound once expert weights exceed the capacity of GPU memory.<n>We address this by adopting CXL-attached near-data processing (CXL-NDP) as the offloading tier to execute cold experts in place.<n>We develop a context-aware MoE system that uses prefill-stage activation statistics to guide decoding-stage expert placement.
arXiv Detail & Related papers (2025-12-04T05:30:53Z)
Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism [29.862588578556366]
MoEpic is an efficient MoE inference system with a novel expert split mechanism.<n>Experiments on popular MoE LLMs demonstrate that MoEpic can save about half of the GPU cost.
arXiv Detail & Related papers (2025-09-10T07:28:24Z)
Mixture of Lookup Experts [63.787712153454464]
Mixture-of-Experts (MoE) activates only a subset of experts during inference.<n>MoLE is a new MoE architecture that is efficient in both communication and VRAM usage.
arXiv Detail & Related papers (2025-03-20T02:31:57Z)
DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference [14.676716521856813]
Mixture-of-Experts (MoE) models face significant deployment challenges on memory-constrained devices.<n>We presentP, an on-device MoE inference engine to optimize parallel GPU- CPU execution.<n>P outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.
arXiv Detail & Related papers (2024-12-16T07:59:21Z)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.