Related papers: Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

URL: http://arxiv.org/abs/2310.09832v3
Date: Tue, 21 Nov 2023 20:30:00 GMT
Title: Merging Experts into One: Improving Computational Efficiency of Mixture of Experts
Authors: Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, Dacheng Tao
Abstract summary: A sparse Mixture of Experts (MoE) can reduce the cost by activating a small subset of parameters. Can we retain the advantages of adding more experts without substantially increasing the computational costs? We propose a computation-efficient approach called textbftexttMerging Experts into One (MEO) which reduces the computation cost to that of a single expert.
Score: 71.44422347502409
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Scaling the size of language models usually leads to remarkable advancements in NLP tasks. But it often comes with a price of growing computational cost. Although a sparse Mixture of Experts (MoE) can reduce the cost by activating a small subset of parameters (e.g., one expert) for each input, its computation escalates significantly if increasing the number of activated experts, limiting its practical utility. Can we retain the advantages of adding more experts without substantially increasing the computational costs? In this paper, we first demonstrate the superiority of selecting multiple experts and then propose a computation-efficient approach called \textbf{\texttt{Merging Experts into One}} (MEO), which reduces the computation cost to that of a single expert. Extensive experiments show that MEO significantly improves computational efficiency, e.g., FLOPS drops from 72.0G of vanilla MoE to 28.6G (MEO). Moreover, we propose a token-level attention block that further enhances the efficiency and performance of token-level MEO, e.g., 83.3\% (MEO) vs. 82.6\% (vanilla MoE) average score on the GLUE benchmark. Our code will be released upon acceptance. Code will be released at: \url{https://github.com/Shwai-He/MEO}.

Related papers

eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference [6.642099288463585]
We propose eMoE, a memory efficient inference system for large language models (LLMs) eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing. It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.
arXiv Detail & Related papers (2025-03-10T01:11:52Z)
Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts [9.393481672669564]
Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation. MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where some experts are overloaded while others remain underutilized. We propose Capacity-Aware Inference, including two key techniques: (1) textbftextitCapacity-Aware Token Drop, which discards overloaded tokens to regulate the maximum latency of MoE, and (2) textbftextitCapacity-Aware Token Reroute, which reallocates overflowed tokens to underutilized experts.
arXiv Detail & Related papers (2025-03-07T01:11:39Z)
MoE-I$^2$: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition [32.97035551579975]
We introduce a two-stage compression method tailored for MoE to reduce the model size and decrease the computational cost. Experiments on Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite, and Mixtral-8$times$7B demonstrate that our proposed methods can both reduce the model size and enhance inference efficiency.
arXiv Detail & Related papers (2024-11-01T20:37:58Z)
Mixture of Parrots: Experts improve memorization more than reasoning [72.445819694797]
We show that as we increase the number of experts, the memorization performance consistently increases while the reasoning capabilities saturate. We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.
arXiv Detail & Related papers (2024-10-24T17:54:41Z)
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts. MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z)
Mixture of A Million Experts [1.240096657086732]
This paper introduces PEER, a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of experts. Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off.
arXiv Detail & Related papers (2024-07-04T20:59:20Z)
Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs [30.07344792770254]
We introduce a gradient-free evolutionary strategy named EEP (Efficient Expert Pruning) to enhance the pruning of experts in SMoE models. EEP relies solely on model inference (i.e., no gradient computation) and greater sparsity while maintaining or even improving performance on downstream tasks. We demonstrate that pruning up to 75% of experts in Mixtral $8times7$B-Instruct results in a substantial reduction in parameters with minimal performance loss.
arXiv Detail & Related papers (2024-07-01T03:57:35Z)
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [84.11508381847929]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks. We propose M-SMoE, which leverages routing statistics to guide expert merging. Our MC-SMoE achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance.
arXiv Detail & Related papers (2023-10-02T16:51:32Z)
MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)
Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training. MoE is hard to be deployed on cloud or mobile environment. We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.