Related papers: Sparse Mixture of Experts as Unified Competitive Learning

Sparse Mixture of Experts as Unified Competitive Learning

URL: http://arxiv.org/abs/2503.22996v1
Date: Sat, 29 Mar 2025 07:15:12 GMT
Title: Sparse Mixture of Experts as Unified Competitive Learning
Authors: Giang Do, Hung Le, Truyen Tran,
Abstract summary: Sparse Mixture of Experts (SMoE) improves the efficiency of large language model training by directing input tokens to a subset of experts.<n>Current SMoEs struggle with tasks such as the Massive Text Embedding Benchmark (MTEB)<n>We propose Unified Competitive Learning SMoE, a novel and efficient framework designed to improve the performance of existing SMoEs.
Score: 34.20340688374905
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Sparse Mixture of Experts (SMoE) improves the efficiency of large language model training by directing input tokens to a subset of experts. Despite its success in generation tasks, its generalization ability remains an open question. In this paper, we demonstrate that current SMoEs, which fall into two categories: (1) Token Choice ;and (2) Expert Choice, struggle with tasks such as the Massive Text Embedding Benchmark (MTEB). By analyzing their mechanism through the lens of competitive learning, our study finds that the Token Choice approach may overly focus on irrelevant experts, while the Expert Choice approach risks discarding important tokens, potentially affecting performance. Motivated by this analysis, we propose Unified Competitive Learning SMoE (USMoE), a novel and efficient framework designed to improve the performance of existing SMoEs in both scenarios: with and without training. Extensive experiments across various tasks show that USMoE achieves up to a 10% improvement over traditional approaches or reduces computational inference costs by 14% while maintaining strong performance.

Related papers

Enhancing CTR Prediction with De-correlated Expert Networks [53.05653547330796]
We propose a De-Correlated MoE (D-MoE) framework, which introduces a Cross-Expert De-Correlation loss to minimize expert correlations.<n>Extensive experiments have been conducted to validate the effectiveness of D-MoE and the de-correlation principle.
arXiv Detail & Related papers (2025-05-23T14:04:38Z)
CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition [33.34992335920672]
We argue that effective SMoE training remains challenging because of the suboptimal routing process.<n>In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response.<n>We develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy.
arXiv Detail & Related papers (2025-05-19T17:24:26Z)
Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models [5.211806751260724]
We propose a hierarchical sparse dictionary learning (HSDL) method that uncovers the collaboration patterns among experts. We also introduce the Contribution-Aware Expert Pruning (CAEP) algorithm, which effectively prunes low-contribution experts.
arXiv Detail & Related papers (2025-04-16T04:06:15Z)
Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations [86.90549830760513]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise in scaling up the learning capacity of neural networks. We propose MoE Experts Compression Suite (MC-Suite) to provide a benchmark for estimating expert importance from diverse perspectives. We present an experimentally validated conjecture that, during expert dropping, SMoEs' instruction-following capabilities are predominantly hurt.
arXiv Detail & Related papers (2025-04-08T00:49:08Z)
S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning [34.20340688374905]
Sparse Mixture of Experts (SMoE) enables efficient training of large language models by routing input tokens to a select number of experts.<n>Recent studies have focused on improving the router to mitigate this problem, but existing approaches face two key limitations.<n>We propose a novel approach called Sparse Mixture of Experts via Robust Learning (S2MoE), which is a mixture of experts designed to learn from both deterministic and nondeterministic inputs.
arXiv Detail & Related papers (2025-03-29T08:14:27Z)
On the effectiveness of discrete representations in sparse mixture of experts [33.809432499123275]
We propose a new architecture dubbed Vector-Quantized Mixture of Experts (VQMoE)<n>VQMoE is an effective solution for scaling up model capacity without increasing the computational costs.<n>We show that VQMoE achieves a 28% improvement in routers compared to other SMoE routing methods.
arXiv Detail & Related papers (2024-11-28T22:32:01Z)
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models [57.582219834039506]
We introduce the training methodologies implemented in the development of Skywork-MoE, a high-performance mixture-of-experts (MoE) large language model (LLM) with 146 billion parameters and 16 experts. It is based on the pre-existing dense checkpoints of our Skywork-13B model.
arXiv Detail & Related papers (2024-06-03T03:58:41Z)
SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z)
CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition [52.2034494666179]
Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. We propose a competition mechanism to address this fundamental challenge of representation collapse. By routing inputs only to experts with the highest neural response, we show that, under mild assumptions, competition enjoys the same convergence rate as the optimal estimator.
arXiv Detail & Related papers (2024-02-04T15:17:09Z)
Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning. In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity. We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.