TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
- URL: http://arxiv.org/abs/2603.01058v1
- Date: Sun, 01 Mar 2026 11:27:37 GMT
- Title: TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
- Authors: Yudong Pan, Yintao He, Tianhua Han, Lian Liu, Shixin Zhao, Zhirong Chen, Mengdi Wang, Cangyuan Li, Yinhe Han, Ying Wang,
- Abstract summary: TriMoE is a novel GPU- CPU-NDP architecture that exploits AMX-enabled CPU to map hot, warm, and cold experts onto their optimal compute units.<n> Experiments demonstrate that TriMoE achieves up to 2.83x speedup over state-of-the-art solutions.
- Score: 38.243293392367086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To deploy large Mixture-of-Experts (MoE) models cost-effectively, offloading-based single-GPU heterogeneous inference is crucial. While GPU-CPU architectures that offload cold experts are constrained by host memory bandwidth, emerging GPU-NDP architectures utilize DIMM-NDP to offload non-hot experts. However, non-hot experts are not a homogeneous memory-bound group: a significant subset of warm experts exists is severely penalized by high GPU I/O latency yet can saturate NDP compute throughput, exposing a critical compute gap. We present TriMoE, a novel GPU-CPU-NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units. We further introduce a bottleneck-aware expert scheduling policy and a prediction-driven dynamic relayout/rebalancing scheme. Experiments demonstrate that TriMoE achieves up to 2.83x speedup over state-of-the-art solutions.
Related papers
- ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling [56.88966608455977]
ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters.<n>ZipMoE achieves up to $72.77%$ inference latency reduction and up to $6.76times$ higher throughput than the state-of-the-art systems.
arXiv Detail & Related papers (2026-01-29T02:51:59Z) - A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems [28.86723467729703]
Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing capabilities that offload experts to dedicated processing units.<n> deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching.
arXiv Detail & Related papers (2026-01-07T15:02:57Z) - WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving [17.92164698813269]
Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance.<n>We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads.<n>WarmServe improves TTFT by up to 50.8$times$ compared to the state-of-the-art autoscaling-based system.
arXiv Detail & Related papers (2025-12-10T09:47:40Z) - Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems [13.222990686403962]
Mixture-of-Experts (MoE) models scale large language models through conditional computation, but inference becomes memory-bound once expert weights exceed the capacity of GPU memory.<n>We address this by adopting CXL-attached near-data processing (CXL-NDP) as the offloading tier to execute cold experts in place.<n>We develop a context-aware MoE system that uses prefill-stage activation statistics to guide decoding-stage expert placement.
arXiv Detail & Related papers (2025-12-04T05:30:53Z) - Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism [29.862588578556366]
MoEpic is an efficient MoE inference system with a novel expert split mechanism.<n>Experiments on popular MoE LLMs demonstrate that MoEpic can save about half of the GPU cost.
arXiv Detail & Related papers (2025-09-10T07:28:24Z) - Mixture of Lookup Experts [63.787712153454464]
Mixture-of-Experts (MoE) activates only a subset of experts during inference.<n>MoLE is a new MoE architecture that is efficient in both communication and VRAM usage.
arXiv Detail & Related papers (2025-03-20T02:31:57Z) - DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference [14.676716521856813]
Mixture-of-Experts (MoE) models face significant deployment challenges on memory-constrained devices.<n>We presentP, an on-device MoE inference engine to optimize parallel GPU- CPU execution.<n>P outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.
arXiv Detail & Related papers (2024-12-16T07:59:21Z) - MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.