SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations
- URL: http://arxiv.org/abs/2512.14080v1
- Date: Tue, 16 Dec 2025 04:39:10 GMT
- Title: SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations
- Authors: Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, Tri Dao,
- Abstract summary: Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost.<n>We propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching.<n>We also propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels.
- Score: 54.303301888915406
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels. As a result, our method SonicMoE reduces activation memory by 45% and achieves a 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE's BF16 MoE kernel for a fine-grained 7B MoE. Concretely, SonicMoE on 64 H100s achieves a training throughput of 213 billion tokens per day comparable to ScatterMoE's 225 billion tokens per day on 96 H100s for a 7B MoE model training with FSDP-2 using the lm-engine codebase. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.16x speedup on kernel execution time compared to vanilla top-$K$ routing while maintaining similar downstream performance. We open-source all our kernels to enable faster MoE model training.
Related papers
- OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale [11.733927781098805]
We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme.<n> OmniMoE introduces scalable routing and execution within a single MoE layer, while retaining a shared dense branch for general-purpose processing.<n>We show that OmniMoE achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines.
arXiv Detail & Related papers (2026-02-05T14:37:32Z) - ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling [56.88966608455977]
ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters.<n>ZipMoE achieves up to $72.77%$ inference latency reduction and up to $6.76times$ higher throughput than the state-of-the-art systems.
arXiv Detail & Related papers (2026-01-29T02:51:59Z) - MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs [9.086910335841772]
"Memory wall" bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures.<n>We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach.<n>We demonstrate that MoEBlaze can achieve over 4x speedups and over 50% memory savings compared to existing MoE frameworks.
arXiv Detail & Related papers (2026-01-08T08:38:23Z) - BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity [66.94629945519125]
We introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques.<n>Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing.<n>Next, to promote both token-level sparsity (TLS) and chunk-level sparsity ( CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly.
arXiv Detail & Related papers (2025-07-11T17:28:56Z) - FlashMoE: Fast Distributed MoE in a Single Kernel [1.866526462692252]
FlashMoE is a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel.<n>We show that FlashMoE achieves up to 9x higher GPU utilization, 6x lower latency, 5.7x higher throughput, and 4x better overlap efficiency compared to state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-05T06:29:14Z) - MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching [2.543762777822215]
MoE-Gen is a high- throughput MoE inference system for singleGPU execution.<n>We introduce module-based tokens, which accumulates in host memory and dynamically launches large batches on to maximize utilization.<n>MoE-Gen achieves 8-31x higher throughput compared to state-of-the-art systems.
arXiv Detail & Related papers (2025-03-12T18:08:01Z) - MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training [45.97480866595295]
Mixture-of-Experts (MoE) enjoys performance gain by increasing model capacity while keeping computation cost constant.
We adopt a 3D sharding method that keeps the dense-to-MoE step time increase within a healthy range.
arXiv Detail & Related papers (2024-05-23T21:00:53Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - MegaBlocks: Efficient Sparse Training with Mixture-of-Experts [19.541303844245835]
MegaBlocks is a system for efficient Mixture-of-Experts (MoE) training on GPUs.
We reformulate MoE in terms of block-sparse operations and develop new block-sparse GPU kernels.
Our approach never drops tokens and maps efficiently to modern hardware, enabling end-to-end training speedups of up to 40% over MoEs.
arXiv Detail & Related papers (2022-11-29T00:27:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.