Related papers: SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

URL: http://arxiv.org/abs/2310.18859v2
Date: Fri, 17 May 2024 23:33:23 GMT
Title: SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
Authors: Zhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun, Qilin Zheng, Yongkai Wu, Ang Li, Hai "Helen" Li, Yiran Chen,
Abstract summary: Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models. Yet, the realization of such benefits often results in ineffective GPU memory utilization. We introduce SiDA-MoE, an efficient inference approach tailored for large MoE models.
Score: 20.16600129902895
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA-MoE ($\textbf{S}$parsity-$\textbf{i}$nspired $\textbf{D}$ata-$\textbf{A}$ware), an efficient inference approach tailored for large MoE models. SiDA-MoE judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA-MoE achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA-MoE attains a remarkable speedup in MoE inference with up to $3.93\times$ throughput increasing, up to $72\%$ latency reduction, and up to $80\%$ GPU memory saving with down to $1\%$ performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even with constrained resources. Code is available at: https://github.com/timlee0212/SiDA-MoE.

Related papers

SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation [82.53411922988039]
We introduce SlimMoE, a multi-stage compression framework for transforming large MoE models into much smaller, efficient variants.<n>Using this framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to create Phi-mini-MoE (7.6B total/2.4B activated parameters) and Phi-tiny-MoE (3.8B total/1.1B activated parameters)<n>Our experiments demonstrate that these compressed models outperform others of similar size and remain competitive with larger models.
arXiv Detail & Related papers (2025-06-23T07:15:59Z)
Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.69640966866059]
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models.<n>In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs.<n>The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware.
arXiv Detail & Related papers (2025-05-07T15:46:36Z)
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float [71.43026659686679]
Large Language Models (LLMs) have grown rapidly in size, creating challenges for efficient deployment on resource-constrained hardware. We introduce Dynamic-Length Float (DFloat11), a compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model.
arXiv Detail & Related papers (2025-04-15T22:38:38Z)
MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching [2.543762777822215]
MoE-Gen is a high- throughput MoE inference system for singleGPU execution. We introduce module-based tokens, which accumulates in host memory and dynamically launches large batches on to maximize utilization. MoE-Gen achieves 8-31x higher throughput compared to state-of-the-art systems.
arXiv Detail & Related papers (2025-03-12T18:08:01Z)
fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving [9.956997242640728]
fMoE is a fine-grained expert offloading system for MoE serving. We show that fMoE reduces inference latency by 47% and improves expert hit rate by 36% over state-of-the-art solutions.
arXiv Detail & Related papers (2025-02-07T22:51:17Z)
APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training. Various memory-efficient Scals have been proposed to reduce memory usage. They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z)
S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs [7.816840847892339]
Speculative decoding (SD) has attracted a significant amount of research attention due to the substantial speedup it can achieve for LLM inference. We propose Skippy Simultaneous Speculative Decoding (or S3D), a cost-effective self-speculative SD method based on simultaneous multi-token decoding and mid-layer skipping. Our method has achieved one of the top performance-memory ratios while requiring minimal architecture changes and training data.
arXiv Detail & Related papers (2024-05-30T17:54:35Z)
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention [82.24166963631949]
We introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead, following the Diffusion Transformers (DiT) design. In addition to better performance than DiT, DiG-S/2 exhibits $2.5times$ higher training speed than DiT-S/2 and saves $75.7%$ memory resolution $179times 1792$. With the same model size, DiG-XL/2 is $4.2times$ faster than the recent Mamba-based diffusion model at a $1024$ resolution, and is $1.8times$ faster than DiT with FlashAttention-2
arXiv Detail & Related papers (2024-05-28T17:59:33Z)
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance. We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z)
FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop. In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z)
KOALA: Empirical Lessons Toward Memory-Efficient and Fast Diffusion Models for Text-to-Image Synthesis [52.42320594388199]
We present three key practices in building an efficient text-to-image model. Based on these findings, we build two types of efficient text-to-image models, called KOALA-Turbo &-Lightning. Unlike SDXL, our KOALA models can generate 1024px high-resolution images on consumer-grade GPUs with 8GB of VRAMs (3060Ti)
arXiv Detail & Related papers (2023-12-07T02:46:18Z)
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [23.207326766883405]
Mixture-of-Experts (MoE) is able to scale its model size without proportionally scaling up its computational requirements. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality.
arXiv Detail & Related papers (2023-08-23T11:25:37Z)
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference [7.743308058511418]
We provide a characterization of two MoE workloads, namely Language Modeling (LM) and Machine Translation (MT) We propose three optimization techniques to mitigate sources of inefficiencies, namely (1) Dynamic gating, (2) Expert Buffering, and (3) Expert load balancing.
arXiv Detail & Related papers (2023-03-10T19:30:15Z)
Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.