Related papers: Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast

Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast

URL: http://arxiv.org/abs/2405.14507v1
Date: Thu, 23 May 2024 12:45:29 GMT
Title: Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast
Authors: Chufan Shi, Cheng Yang, Xinyu Zhu, Jiahao Wang, Taiqiang Wu, Siheng Li, Deng Cai, Yujiu Yang, Yu Meng,
Abstract summary: Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency. We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference. Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
Score: 58.98411447739218
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency. In MoE, each token in the input sequence activates a different subset of experts determined by a routing mechanism. However, the unchosen experts in MoE models do not contribute to the output, potentially leading to underutilization of the model's capacity. In this work, we first conduct exploratory studies to demonstrate that increasing the number of activated experts does not necessarily improve and can even degrade the output quality. Then, we show that output distributions from an MoE model using different routing strategies substantially differ, indicating that different experts do not always act synergistically. Motivated by these findings, we propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference. In SCMoE, the next-token probabilities are determined by contrasting the outputs from strong and weak activation using the same MoE model. Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding. Experiments on several benchmarks (GSM8K, StrategyQA, MBPP and HumanEval) demonstrate that SCMoE can consistently enhance Mixtral 8x7B's reasoning capability across various domains. For example, it improves the accuracy on GSM8K from 61.79 to 66.94. Moreover, combining SCMoE with self-consistency yields additional gains, increasing major@20 accuracy from 75.59 to 78.31.

Related papers

Enhancing CTR Prediction with De-correlated Expert Networks [53.05653547330796]
We propose a De-Correlated MoE (D-MoE) framework, which introduces a Cross-Expert De-Correlation loss to minimize expert correlations.<n>Extensive experiments have been conducted to validate the effectiveness of D-MoE and the de-correlation principle.
arXiv Detail & Related papers (2025-05-23T14:04:38Z)
CoMoE: Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning [5.161314094237747]
We propose Contrastive Representation for MoE (CoMoE) to promote modularization and specialization in MoE.<n>Experiments on several benchmarks and in multi-task settings demonstrate that CoMoE can consistently enhance MoE's capacity and promote modularization among the experts.
arXiv Detail & Related papers (2025-05-23T06:58:44Z)
Improving Routing in Sparse Mixture of Experts with Graph of Tokens [32.46693871593765]
We unveil the limitation of Sparse Mixture of Experts (SMoE) through the perspective of the probabilistic graphical model (PGM)<n>We propose the novel Similarity-Aware (S)MoE, which considers interactions between tokens during expert selection.<n>We empirically validate our models on various tasks and domains, showing significant improvements in reducing routing fluctuations.
arXiv Detail & Related papers (2025-05-01T18:44:20Z)
Accelerating MoE Model Inference with Expert Sharding [1.4733737463429546]
Mixture of experts (MoE) models achieve state-of-the-art results in language modeling but suffer from inefficient hardware utilization due to imbalanced token routing and communication overhead. We introduce MoEShard, an inference system that achieves perfect load balancing through tensor sharding of MoE experts.
arXiv Detail & Related papers (2025-03-11T14:15:01Z)
OMoE: Diversifying Mixture of Low-Rank Adaptation by Orthogonal Finetuning [3.8813502422318127]
Building mixture-of-experts (MoE) architecture for Low-rank adaptation (LoRA) is emerging as a potential direction in parameter-efficient fine-tuning (PEFT) We first conduct qualitative analysis to indicate that experts collapse to similar representations in vanilla MoE, limiting the capacity of modular design and computational efficiency. Motivated by these findings, we propose Orthogonal Mixture-of-Experts (OMoE) Our method is simple and alleviates memory bottlenecks, as it incurs minimal experts compared to vanilla MoE models.
arXiv Detail & Related papers (2025-01-17T09:27:08Z)
MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router [55.88046193872355]
Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption and redundancy in experts. We propose MoE-Pruner, a method that prunes weights with the smallest magnitudes multiplied by the corresponding input activations and router weights. Our pruning method is one-shot, requiring no retraining or weight updates.
arXiv Detail & Related papers (2024-10-15T19:22:27Z)
Retraining-Free Merging of Sparse Mixture-of-Experts via Hierarchical Clustering [14.858134039539697]
We propose Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE) HC-SMoE is a task-agnostic expert merging framework that reduces SMoE model parameters without retraining. We validate our approach through extensive experiments on eight zero-shot language tasks and demonstrate its effectiveness in large-scale SMoE models such as Qwen and Mixtral.
arXiv Detail & Related papers (2024-10-11T07:36:14Z)
GRIN: GRadient-INformed MoE [132.87651078514122]
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing. We introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
arXiv Detail & Related papers (2024-09-18T17:00:20Z)
Layerwise Recurrent Router for Mixture-of-Experts [42.36093735411238]
Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs. Current MoE models often display parameter inefficiency. We introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE)
arXiv Detail & Related papers (2024-08-13T10:25:13Z)
A Closer Look into Mixture-of-Experts in Large Language Models [26.503570706063634]
Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance. MoE architecture could increase the model size without sacrificing computational efficiency. We make an initial attempt to understand the inner workings of MoE-based large language models.
arXiv Detail & Related papers (2024-06-26T10:07:57Z)
Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark [46.72960840801211]
Mixture-of-Experts(MoE) approach offers a promising way to scale Large Language Models(LLMs) MoE suffers from significant memory overheads, necessitating model compression techniques. This paper explores several MoE structure-aware quantizations, ranging from coarse to fine granularity, from MoE block to individual linear weight.
arXiv Detail & Related papers (2024-06-12T12:44:48Z)
Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment [103.05005690990271]
We propose a novel framework that combines reasoning chains and expert mixtures to improve self-alignments. MoTE employs a structured reasoning chain comprising four key stages: Question Analysis, Answer Guidance, Safe Answer, and Safety Checking. MoTE significantly improves model safety, jailbreak resistance, and over-refusal capabilities, achieving performance comparable to OpenAI's state-of-the-art o1 model.
arXiv Detail & Related papers (2024-05-01T15:06:05Z)
Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning. In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity. We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z)
Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training. MoE is hard to be deployed on cloud or mobile environment. We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z)
Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts) Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.