Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert
(MoE) Inference
- URL: http://arxiv.org/abs/2303.06182v2
- Date: Sun, 18 Jun 2023 01:33:19 GMT
- Title: Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert
(MoE) Inference
- Authors: Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Hsien-Hsin S. Lee,
Anjali Sridhar, Shruti Bhosale, Carole-Jean Wu, Benjamin Lee
- Abstract summary: We provide a characterization of two MoE workloads, namely Language Modeling (LM) and Machine Translation (MT)
We propose three optimization techniques to mitigate sources of inefficiencies, namely (1) Dynamic gating, (2) Expert Buffering, and (3) Expert load balancing.
- Score: 7.743308058511418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture-of-Experts (MoE) models have gained popularity in achieving
state-of-the-art performance in a wide range of tasks in computer vision and
natural language processing. They effectively expand the model capacity while
incurring a minimal increase in computation cost during training. However,
deploying such models for inference is difficult due to their large size and
complex communication pattern. In this work, we provide a characterization of
two MoE workloads, namely Language Modeling (LM) and Machine Translation (MT)
and identify their sources of inefficiencies at deployment. We propose three
optimization techniques to mitigate sources of inefficiencies, namely (1)
Dynamic gating, (2) Expert Buffering, and (3) Expert load balancing. We show
that dynamic gating improves maximum throughput by 6.21-11.23$\times$ for LM,
5.75-10.98$\times$ for MT Encoder and 2.58-5.71$\times$ for MT Decoder. It also
reduces memory usage by up to 1.36$\times$ for LM and up to 1.1$\times$ for MT.
We further propose Expert Buffering, a new caching mechanism that only keeps
hot, active experts in GPU memory while buffering the rest in CPU memory. This
reduces static memory allocation by up to 1.47$\times$. We finally propose a
load balancing methodology that provides additional scalability to the
workload.
Related papers
- HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching [35.83447642182576]
Large Language Models (LLMs) have demonstrated remarkable capabilities.
LLMs' deployment the main part of carbon emission from nowadays AI applications.
This paper proposes a model modularization algorithm to enable LLM inference on outdated hardware.
arXiv Detail & Related papers (2024-10-17T08:33:39Z) - MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models [15.346491299728463]
MoNDE reduces the volume of MoE parameter movement by transferring only the $textithot$ experts to the GPU.
MoNDE enables far more communication-efficient MoE inference, thereby resulting in substantial speedups.
arXiv Detail & Related papers (2024-05-29T07:23:29Z) - PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System [21.09681871279162]
Modern Machine Learning (ML) training on large-scale datasets is a time-consuming workload.
It relies on the optimization algorithm Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance.
processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads.
Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck.
arXiv Detail & Related papers (2024-04-10T17:00:04Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models [20.16600129902895]
Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models.
Yet, the realization of such benefits often results in ineffective GPU memory utilization.
We introduce SiDA-MoE, an efficient inference approach tailored for large MoE models.
arXiv Detail & Related papers (2023-10-29T01:08:55Z) - Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training.
MoE is hard to be deployed on cloud or mobile environment.
We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z) - CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference.
We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch.
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.