Related papers: MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

URL: http://arxiv.org/abs/2505.19645v2
Date: Fri, 13 Jun 2025 14:54:40 GMT
Title: MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE
Authors: Zongle Huang, Lei Zhu, Zongyuan Zhan, Ting Hu, Weikai Mao, Xianzhi Yu, Yongpan Liu, Tianyu Zhang,
Abstract summary: Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss.<n>We show that under medium batch sizes, MoE surprisingly benefits more from SD than dense models.<n>We introduce a new metric 'target efficiency' that characterizes these effects.
Score: 16.413800846658564
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less computation. Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss, but it has been considered efficient only for dense models. In this work, we first demonstrate that, under medium batch sizes, MoE surprisingly benefits more from SD than dense models. Furthermore, as MoE becomes sparser -- the prevailing trend in MoE designs -- the batch size range where SD acceleration is expected to be effective becomes broader. To quantitatively understand tradeoffs involved in SD, we develop a reliable modeling based on theoretical analyses. While current SD research primarily focuses on improving acceptance rates of algorithms, changes in workload and model architecture can still lead to degraded SD acceleration even with high acceptance rates. To address this limitation, we introduce a new metric 'target efficiency' that characterizes these effects, thus helping researchers identify system bottlenecks and understand SD acceleration more comprehensively. For scenarios like private serving, this work unveils a new perspective to speed up MoE inference, where existing solutions struggle. Experiments on different GPUs show up to 2.29x speedup for Qwen2-57B-A14B at medium batch sizes and validate our theoretical predictions.

Related papers

Faster MoE LLM Inference for Extremely Large Models [75.57674991584608]
Fine-grained MoE models are gaining popularity, yet research on them remains limited.<n>Reducing the number of activated experts can lead to substantial efficiency improvements in certain scenarios.<n>Our method can increase throughput by at least 10% without any performance degradation.
arXiv Detail & Related papers (2025-05-06T13:41:17Z)
fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving [9.956997242640728]
fMoE is a fine-grained expert offloading system for MoE serving.<n>We show that fMoE reduces inference latency by 47% and improves expert hit rate by 36% over state-of-the-art solutions.
arXiv Detail & Related papers (2025-02-07T22:51:17Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts [47.01697456105496]
Mixture-of-Experts (MoE) is a promising way to scale up the learning capacity of large language models.<n>MoE suffers from significant memory overheads due to the vast parameter size.<n>Post-training quantization offers a powerful approach for model compression.
arXiv Detail & Related papers (2024-06-12T12:44:48Z)
SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z)
Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations [0.0]
This thesis explores the methods of model compression. We empirically demonstrate that the simple method of skipping latter attention sublayers in Transformer LLMs is an effective method of model compression. We observed a 21% speed increase in one-token generation for Llama 2 7B, whilst surprisingly and unexpectedly improving performance over several common benchmarks.
arXiv Detail & Related papers (2024-04-02T19:53:54Z)
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z)
A-SDM: Accelerating Stable Diffusion through Redundancy Removal and Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network. We then prune the redundancy blocks of the model and maintain the network performance. Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z)
EBJR: Energy-Based Joint Reasoning for Adaptive Inference [10.447353952054492]
State-of-the-art deep learning models have achieved significant performance levels on various benchmarks. Light-weight architectures, on the other hand, achieve moderate accuracies, but at a much more desirable latency. This paper presents a new method of jointly using the large accurate models together with the small fast ones.
arXiv Detail & Related papers (2021-10-20T02:33:31Z)
MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.