Related papers: AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models

AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models

URL: http://arxiv.org/abs/2406.13233v2
Date: Mon, 14 Oct 2024 03:20:02 GMT
Title: AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models
Authors: Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, Zhijie Deng,
Abstract summary: We introduce AdaMoE to realize token-adaptive routing for MoE. AdaMoE does not force each token to occupy a fixed number of null experts. It can reduce average expert load (FLOPs) while achieving superior performance.
Score: 14.646419975663367
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture of experts (MoE) has become the standard for constructing production-level large language models (LLMs) due to its promise to boost model capacity without causing significant overheads. Nevertheless, existing MoE methods usually enforce a constant top-k routing for all tokens, which is arguably restrictive because various tokens (e.g., "<EOS>" vs. "apple") may require various numbers of experts for feature abstraction. Lifting such a constraint can help make the most of limited resources and unleash the potential of the model for downstream tasks. In this sense, we introduce AdaMoE to realize token-adaptive routing for MoE, where different tokens are permitted to select a various number of experts. AdaMoE makes minimal modifications to the vanilla MoE with top-k routing -- it simply introduces a fixed number of null experts, which do not consume any FLOPs, to the expert set and increases the value of k. AdaMoE does not force each token to occupy a fixed number of null experts but ensures the average usage of the null experts with a load-balancing loss, leading to an adaptive number of null/true experts used by each token. AdaMoE exhibits a strong resemblance to MoEs with expert choice routing while allowing for trivial auto-regressive modeling. AdaMoE is easy to implement and can be effectively applied to pre-trained (MoE-)LLMs. Extensive studies show that AdaMoE can reduce average expert load (FLOPs) while achieving superior performance. For example, on the ARC-C dataset, applying our method to fine-tuning Mixtral-8x7B can reduce FLOPs by 14.5% while increasing accuracy by 1.69%.

Related papers

ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration [61.579842548990754]
Mixture-of-Experts (MoE) Transformer, the backbone of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones.
arXiv Detail & Related papers (2025-03-10T03:15:54Z)
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing [28.73697327316267]
Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. We propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing. ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity.
arXiv Detail & Related papers (2024-12-19T10:21:20Z)
Ada-K Routing: Boosting the Efficiency of MoE-based LLMs [6.954735360168147]
We propose a novel Ada-K routing strategy that dynamically adjusts the number of activated experts for each token. Our strategy incorporates learnable and lightweight allocator modules that decide customized expert resource allocation tailored to the contextual needs for each token.
arXiv Detail & Related papers (2024-10-14T12:50:04Z)
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts. MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z)
BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts [41.83123857437985]
Training MoEs from scratch in a large-scale regime is prohibitively expensive. We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming. Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance.
arXiv Detail & Related papers (2024-08-15T17:19:12Z)
GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory [49.536752342048075]
Mixture-of-Experts (MoE) has been demonstrated as an efficient method to scale up models. We propose a new fine-tuning method, GW-MoE, to address this issue.
arXiv Detail & Related papers (2024-06-18T08:03:51Z)
Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency. We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference. Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z)
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning [7.094820944028638]
We propose an extremely parameter-efficient MoE by combining MoE architecture with lightweight experts. Our method generalizes to unseen tasks as it does not depend on any prior task knowledge. Our research underscores the versatility of the mixture of experts architecture, showcasing its ability to deliver robust performance even when subjected to rigorous parameter constraints.
arXiv Detail & Related papers (2023-09-11T13:31:00Z)
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks [74.68583356645276]
In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis. We show for the first time that pMoE provably reduces the required number of training samples to achieve desirable generalization.
arXiv Detail & Related papers (2023-06-07T00:16:10Z)
AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation [104.0979785739202]
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks. Existing MoE models mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network. We develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints.
arXiv Detail & Related papers (2022-10-14T05:32:17Z)
Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training. MoE is hard to be deployed on cloud or mobile environment. We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.