Related papers: CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition

CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition

URL: http://arxiv.org/abs/2505.13380v1
Date: Mon, 19 May 2025 17:24:26 GMT
Title: CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition
Authors: Nam V. Nguyen, Huy Nguyen, Quang Pham, Van Nguyen, Savitha Ramasamy, Nhat Ho,
Abstract summary: We argue that effective SMoE training remains challenging because of the suboptimal routing process.<n>In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response.<n>We develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy.
Score: 33.34992335920672
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, we argue that effective SMoE training remains challenging because of the suboptimal routing process where experts that perform computation do not directly contribute to the routing process. In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response. Theoretically, we show that the competition mechanism enjoys a better sample efficiency than the traditional softmax routing. Furthermore, we develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy, thus enjoying strong performances at a low training overhead. Our extensive empirical evaluations on both the visual instruction tuning and language pre-training tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies. We have made the implementation available at: https://github.com/Fsoft-AIC/CompeteSMoE. This work is an improved version of the previous study at arXiv:2402.02526

Related papers

Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design [36.35520569052556]
Mixture-of-Experts (MoE) has successfully scaled up models while maintaining nearly constant computing costs.<n>We propose a novel collaboration-constrained routing (C2R) strategy to encourage more specialized expert groups.<n>We achieve an average performance improvement of 0.51% and 0.33% on ten downstream NLP benchmarks.
arXiv Detail & Related papers (2025-04-02T03:51:59Z)
Sparse Mixture of Experts as Unified Competitive Learning [34.20340688374905]
Sparse Mixture of Experts (SMoE) improves the efficiency of large language model training by directing input tokens to a subset of experts.<n>Current SMoEs struggle with tasks such as the Massive Text Embedding Benchmark (MTEB)<n>We propose Unified Competitive Learning SMoE, a novel and efficient framework designed to improve the performance of existing SMoEs.
arXiv Detail & Related papers (2025-03-29T07:15:12Z)
Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts [33.39800923804871]
We introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race.<n>By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens.
arXiv Detail & Related papers (2025-03-20T11:45:08Z)
Accelerating MoE Model Inference with Expert Sharding [1.4733737463429546]
Mixture of experts (MoE) models achieve state-of-the-art results in language modeling but suffer from inefficient hardware utilization due to imbalanced token routing and communication overhead.<n>We introduce MoEShard, an inference system that achieves perfect load balancing through tensor sharding of MoE experts.
arXiv Detail & Related papers (2025-03-11T14:15:01Z)
On the effectiveness of discrete representations in sparse mixture of experts [33.809432499123275]
We propose a new architecture dubbed Vector-Quantized Mixture of Experts (VQMoE)<n>VQMoE is an effective solution for scaling up model capacity without increasing the computational costs.<n>We show that VQMoE achieves a 28% improvement in routers compared to other SMoE routing methods.
arXiv Detail & Related papers (2024-11-28T22:32:01Z)
TeamLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition [61.91764883512776]
We introduce an innovative PEFT method, TeamLoRA, consisting of a collaboration and competition module for experts. By doing so, TeamLoRA connects the experts as a "Team" with internal collaboration and competition, enabling a faster and more accurate PEFT paradigm for multi-task learning.
arXiv Detail & Related papers (2024-08-19T09:58:53Z)
A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models. We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy. Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z)
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM [81.18305296110853]
We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously.
arXiv Detail & Related papers (2024-03-12T16:54:58Z)
CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition [52.2034494666179]
Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. We propose a competition mechanism to address this fundamental challenge of representation collapse. By routing inputs only to experts with the highest neural response, we show that, under mild assumptions, competition enjoys the same convergence rate as the optimal estimator.
arXiv Detail & Related papers (2024-02-04T15:17:09Z)
Checkmating One, by Using Many: Combining Mixture of Experts with MCTS to Improve in Chess [20.043363738256176]
This paper presents a new approach that integrates deep learning with computational chess, using both the Mixture of Experts (MoE) method and Monte-Carlo Tree Search (MCTS) Our framework combines the MoE method with MCTS, in order to align it with the strategic phases of chess, thus departing from the conventional one-for-all'' model. Our empirical research shows a substantial improvement in playing strength, surpassing the traditional single-model framework.
arXiv Detail & Related papers (2024-01-30T09:55:14Z)
Benchmarking Robustness and Generalization in Multi-Agent Systems: A Case Study on Neural MMO [50.58083807719749]
We present the results of the second Neural MMO challenge, hosted at IJCAI 2022, which received 1600+ submissions. This competition targets robustness and generalization in multi-agent systems. We will open-source our benchmark including the environment wrapper, baselines, a visualization tool, and selected policies for further research.
arXiv Detail & Related papers (2023-08-30T07:16:11Z)
Continual Competitive Memory: A Neural System for Online Task-Free Lifelong Learning [91.3755431537592]
We propose a novel form of unsupervised learning, continual competitive memory ( CCM) The resulting neural system is shown to offer an effective approach for combating catastrophic forgetting in online continual classification problems. We demonstrate that the proposed CCM system not only outperforms other competitive learning neural models but also yields performance that is competitive with several modern, state-of-the-art lifelong learning approaches.
arXiv Detail & Related papers (2021-06-24T20:12:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.