DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
- URL: http://arxiv.org/abs/2401.06066v1
- Date: Thu, 11 Jan 2024 17:31:42 GMT
- Title: DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
- Authors: Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli
Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y.K. Li,
Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, Wenfeng Liang
- Abstract summary: We propose the DeepSeekMoE architecture towards ultimate expert specialization.
It involves two principal strategies: (1) finely segmenting the experts into $mN$ ones and activating $mK$ from them, allowing for a more flexible combination of activated experts.
We show that DeepSeekMoE achieves comparable performance with GShard 2.9B, which has 1.5 times the expert parameters and computation.
- Score: 26.447210565680116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the era of large language models, Mixture-of-Experts (MoE) is a promising
architecture for managing computational costs when scaling up model parameters.
However, conventional MoE architectures like GShard, which activate the top-$K$
out of $N$ experts, face challenges in ensuring expert specialization, i.e.
each expert acquires non-overlapping and focused knowledge. In response, we
propose the DeepSeekMoE architecture towards ultimate expert specialization. It
involves two principal strategies: (1) finely segmenting the experts into $mN$
ones and activating $mK$ from them, allowing for a more flexible combination of
activated experts; (2) isolating $K_s$ experts as shared ones, aiming at
capturing common knowledge and mitigating redundancy in routed experts.
Starting from a modest scale with 2B parameters, we demonstrate that
DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5
times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly
approaches the performance of its dense counterpart with the same number of
total parameters, which set the upper bound of MoE models. Subsequently, we
scale up DeepSeekMoE to 16B parameters and show that it achieves comparable
performance with LLaMA2 7B, with only about 40% of computations. Further, our
preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently
validate its substantial advantages over the GShard architecture, and show its
performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%)
of computations.
Related papers
- MoE-I$^2$: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition [32.97035551579975]
We introduce a two-stage compression method tailored for MoE to reduce the model size and decrease the computational cost.
Experiments on Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite, and Mixtral-8$times$7B demonstrate that our proposed methods can both reduce the model size and enhance inference efficiency.
arXiv Detail & Related papers (2024-11-01T20:37:58Z) - BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts [41.83123857437985]
Training MoEs from scratch in a large-scale regime is prohibitively expensive.
We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming.
Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance.
arXiv Detail & Related papers (2024-08-15T17:19:12Z) - Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models [24.915387910764082]
Expert-Specialized Fine-Tuning, or ESFT, tunes the experts most relevant to downstream tasks while freezing the other experts and modules.
MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks.
arXiv Detail & Related papers (2024-07-02T03:11:13Z) - DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model [118.06260386652778]
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference.
DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE.
Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs.
arXiv Detail & Related papers (2024-05-07T15:56:43Z) - Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training [73.90260246781435]
We present Lory, the first approach that scales such architectures to autoregressive language model pre-training.
We show significant performance gains over parameter-matched dense models on both perplexity and a variety of downstream tasks.
Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing.
arXiv Detail & Related papers (2024-05-06T03:06:33Z) - Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations.
A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization.
We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z) - Merging Experts into One: Improving Computational Efficiency of Mixture
of Experts [71.44422347502409]
A sparse Mixture of Experts (MoE) can reduce the cost by activating a small subset of parameters.
Can we retain the advantages of adding more experts without substantially increasing the computational costs?
We propose a computation-efficient approach called textbftexttMerging Experts into One (MEO) which reduces the computation cost to that of a single expert.
arXiv Detail & Related papers (2023-10-15T13:28:42Z) - Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [84.11508381847929]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks.
We propose M-SMoE, which leverages routing statistics to guide expert merging.
Our MC-SMoE achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance.
arXiv Detail & Related papers (2023-10-02T16:51:32Z) - Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient
MoE for Instruction Tuning [7.094820944028638]
We propose an extremely parameter-efficient MoE by combining MoE architecture with lightweight experts.
Our method generalizes to unseen tasks as it does not depend on any prior task knowledge.
Our research underscores the versatility of the mixture of experts architecture, showcasing its ability to deliver robust performance even when subjected to rigorous parameter constraints.
arXiv Detail & Related papers (2023-09-11T13:31:00Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.