Merging Experts into One: Improving Computational Efficiency of Mixture
  of Experts
        - URL: http://arxiv.org/abs/2310.09832v3
- Date: Tue, 21 Nov 2023 20:30:00 GMT
- Title: Merging Experts into One: Improving Computational Efficiency of Mixture
  of Experts
- Authors: Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, Dacheng Tao
- Abstract summary: A sparse Mixture of Experts (MoE) can reduce the cost by activating a small subset of parameters.
Can we retain the advantages of adding more experts without substantially increasing the computational costs?
We propose a computation-efficient approach called textbftexttMerging Experts into One (MEO) which reduces the computation cost to that of a single expert.
- Score: 71.44422347502409
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract:   Scaling the size of language models usually leads to remarkable advancements
in NLP tasks. But it often comes with a price of growing computational cost.
Although a sparse Mixture of Experts (MoE) can reduce the cost by activating a
small subset of parameters (e.g., one expert) for each input, its computation
escalates significantly if increasing the number of activated experts, limiting
its practical utility. Can we retain the advantages of adding more experts
without substantially increasing the computational costs? In this paper, we
first demonstrate the superiority of selecting multiple experts and then
propose a computation-efficient approach called \textbf{\texttt{Merging Experts
into One}} (MEO), which reduces the computation cost to that of a single
expert. Extensive experiments show that MEO significantly improves
computational efficiency, e.g., FLOPS drops from 72.0G of vanilla MoE to 28.6G
(MEO). Moreover, we propose a token-level attention block that further enhances
the efficiency and performance of token-level MEO, e.g., 83.3\% (MEO) vs.
82.6\% (vanilla MoE) average score on the GLUE benchmark. Our code will be
released upon acceptance. Code will be released at:
\url{https://github.com/Shwai-He/MEO}.
 
      
        Related papers
        - eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model   Inference [6.642099288463585]
 We propose eMoE, a memory efficient inference system for large language models (LLMs)
eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing.
It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.
 arXiv  Detail & Related papers  (2025-03-10T01:11:52Z)
- Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of   Experts [9.393481672669564]
 Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation.
MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where some experts are overloaded while others remain underutilized.
We propose Capacity-Aware Inference, including two key techniques: (1) textbftextitCapacity-Aware Token Drop, which discards overloaded tokens to regulate the maximum latency of MoE, and (2) textbftextitCapacity-Aware Token Reroute, which reallocates overflowed tokens to underutilized experts.
 arXiv  Detail & Related papers  (2025-03-07T01:11:39Z)
- MoE-I$^2$: Compressing Mixture of Experts Models through Inter-Expert   Pruning and Intra-Expert Low-Rank Decomposition [32.97035551579975]
 We introduce a two-stage compression method tailored for MoE to reduce the model size and decrease the computational cost.
Experiments on Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite, and Mixtral-8$times$7B demonstrate that our proposed methods can both reduce the model size and enhance inference efficiency.
 arXiv  Detail & Related papers  (2024-11-01T20:37:58Z)
- Mixture of Parrots: Experts improve memorization more than reasoning [72.445819694797]
 We show that as we increase the number of experts, the memorization performance consistently increases while the reasoning capabilities saturate.
We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.
 arXiv  Detail & Related papers  (2024-10-24T17:54:41Z)
- MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation   Experts [63.67734699877724]
 MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts.
MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
 arXiv  Detail & Related papers  (2024-10-09T18:01:27Z)
- Mixture of A Million Experts [1.240096657086732]
 This paper introduces PEER, a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of experts.
 Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off.
 arXiv  Detail & Related papers  (2024-07-04T20:59:20Z)
- Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models:   Enhancing Performance and Reducing Inference Costs [30.07344792770254]
 We introduce a gradient-free evolutionary strategy named EEP (Efficient Expert Pruning) to enhance the pruning of experts in SMoE models.
EEP relies solely on model inference (i.e., no gradient computation) and greater sparsity while maintaining or even improving performance on downstream tasks.
We demonstrate that pruning up to 75% of experts in Mixtral $8times7$B-Instruct results in a substantial reduction in parameters with minimal performance loss.
 arXiv  Detail & Related papers  (2024-07-01T03:57:35Z)
- Merge, Then Compress: Demystify Efficient SMoE with Hints from Its   Routing Policy [84.11508381847929]
 Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks.
We propose M-SMoE, which leverages routing statistics to guide expert merging.
Our MC-SMoE achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance.
 arXiv  Detail & Related papers  (2023-10-02T16:51:32Z)
- MoEC: Mixture of Expert Clusters [93.63738535295866]
 Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
 arXiv  Detail & Related papers  (2022-07-19T06:09:55Z)
- Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
 Mixture-of-Experts (MoE) model is powerful for large-scale pre-training.
MoE is hard to be deployed on cloud or mobile environment.
We propose a general method to progressively drop the non-professional experts for the target downstream task.
 arXiv  Detail & Related papers  (2022-06-01T07:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.