MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs
- URL: http://arxiv.org/abs/2601.05296v1
- Date: Thu, 08 Jan 2026 08:38:23 GMT
- Title: MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs
- Authors: Jiyuan Zhang, Yining Liu, Siqi Yan, Lisen Deng, Jennifer Cao, Shuqi Yang, Min Ni, Bi Xue, Shen Li,
- Abstract summary: "Memory wall" bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures.<n>We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach.<n>We demonstrate that MoEBlaze can achieve over 4x speedups and over 50% memory savings compared to existing MoE frameworks.
- Score: 9.086910335841772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The pervasive "memory wall" bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures. MoE's inherent architectural sparsity leads to sparse arithmetic compute and also introduces substantial activation memory overheads -- driven by large token routing buffers and the need to materialize and buffer intermediate tensors. This memory pressure limits the maximum batch size and sequence length that can fit on GPUs, and also results in excessive data movements that hinders performance and efficient model scaling. We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach: (i) an end-to-end token dispatch and MoE training method with optimized data structures to eliminate intermediate buffers and activation materializing, and (ii) co-designed kernels with smart activation checkpoint to mitigate memory footprint while simultaneously achieving better performance. We demonstrate that MoEBlaze can achieve over 4x speedups and over 50% memory savings compared to existing MoE frameworks.
Related papers
- AllMem: A Memory-centric Recipe for Efficient Long-context Modeling [32.025154452526856]
Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks.<n>We introduce textscAllMem, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks.
arXiv Detail & Related papers (2026-02-14T09:04:28Z) - Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts [68.79341332280062]
Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time.<n>We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier.<n>Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint.
arXiv Detail & Related papers (2026-02-02T13:52:40Z) - ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling [56.88966608455977]
ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters.<n>ZipMoE achieves up to $72.77%$ inference latency reduction and up to $6.76times$ higher throughput than the state-of-the-art systems.
arXiv Detail & Related papers (2026-01-29T02:51:59Z) - SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations [54.303301888915406]
Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost.<n>We propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching.<n>We also propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels.
arXiv Detail & Related papers (2025-12-16T04:39:10Z) - Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference [16.71963410333802]
Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks.<n>MoC substantially reduces activation memory during pre-training.<n>MoC delivers significant memory savings and throughput gains while maintaining competitive model performance.
arXiv Detail & Related papers (2025-11-12T13:30:57Z) - Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [108.0657508755532]
We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
arXiv Detail & Related papers (2025-07-09T07:27:00Z) - MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative
Model Inference with Unstructured Sparsity [12.663030430488922]
We propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference on high-performance Cores.
At SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9x and 1.5x, respectively.
arXiv Detail & Related papers (2023-09-19T03:20:02Z) - Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [23.207326766883405]
Mixture-of-Experts (MoE) is able to scale its model size without proportionally scaling up its computational requirements.
Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation.
We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality.
arXiv Detail & Related papers (2023-08-23T11:25:37Z) - MegaBlocks: Efficient Sparse Training with Mixture-of-Experts [19.541303844245835]
MegaBlocks is a system for efficient Mixture-of-Experts (MoE) training on GPUs.
We reformulate MoE in terms of block-sparse operations and develop new block-sparse GPU kernels.
Our approach never drops tokens and maps efficiently to modern hardware, enabling end-to-end training speedups of up to 40% over MoEs.
arXiv Detail & Related papers (2022-11-29T00:27:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.