Related papers: CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition

CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition

URL: http://arxiv.org/abs/2402.02526v1
Date: Sun, 4 Feb 2024 15:17:09 GMT
Title: CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition
Authors: Quang Pham, Giang Do, Huy Nguyen, TrungTin Nguyen, Chenghao Liu, Mina Sartipi, Binh T. Nguyen, Savitha Ramasamy, Xiaoli Li, Steven Hoi, Nhat Ho
Abstract summary: Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. We propose a competition mechanism to address this fundamental challenge of representation collapse. By routing inputs only to experts with the highest neural response, we show that, under mild assumptions, competition enjoys the same convergence rate as the optimal estimator.
Score: 52.2034494666179
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, effective training of SMoE has proven to be challenging due to the representation collapse issue, which causes parameter redundancy and limited representation potentials. In this work, we propose a competition mechanism to address this fundamental challenge of representation collapse. By routing inputs only to experts with the highest neural response, we show that, under mild assumptions, competition enjoys the same convergence rate as the optimal estimator. We further propose CompeteSMoE, an effective and efficient algorithm to train large language models by deploying a simple router that predicts the competition outcomes. Consequently, CompeteSMoE enjoys strong performance gains from the competition routing policy while having low computation overheads. Our extensive empirical evaluations on two transformer architectures and a wide range of tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies.

Related papers

Efficient Training of Large-Scale AI Models Through Federated Mixture-of-Experts: A System-Level Approach [52.79991638077892]
This article highlights a critical, yet underexplored concept: the absence of robust quantitative strategies for dynamic client-expert alignment.<n>We propose a conceptual system design for intelligent client-expert alignment that incorporates dynamic fitness scoring, global expert load monitoring, and client capacity profiling.
arXiv Detail & Related papers (2025-07-08T05:30:37Z)
CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition [33.34992335920672]
We argue that effective SMoE training remains challenging because of the suboptimal routing process.<n>In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response.<n>We develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy.
arXiv Detail & Related papers (2025-05-19T17:24:26Z)
Sparse Mixture of Experts as Unified Competitive Learning [34.20340688374905]
Sparse Mixture of Experts (SMoE) improves the efficiency of large language model training by directing input tokens to a subset of experts. Current SMoEs struggle with tasks such as the Massive Text Embedding Benchmark (MTEB) We propose Unified Competitive Learning SMoE, a novel and efficient framework designed to improve the performance of existing SMoEs.
arXiv Detail & Related papers (2025-03-29T07:15:12Z)
Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts [33.39800923804871]
We introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens.
arXiv Detail & Related papers (2025-03-20T11:45:08Z)
On the effectiveness of discrete representations in sparse mixture of experts [33.809432499123275]
We propose a new architecture dubbed Vector-Quantized Mixture of Experts (VQMoE) VQMoE is an effective solution for scaling up model capacity without increasing the computational costs. We show that VQMoE achieves a 28% improvement in routers compared to other SMoE routing methods.
arXiv Detail & Related papers (2024-11-28T22:32:01Z)
Sparse Mixture-of-Experts for Compositional Generalization: Empirical Evidence and Theoretical Foundations of Optimal Sparsity [89.81738321188391]
This study investigates the relationship between task complexity and optimal sparsity in SMoE models.<n>We show that the optimal sparsity lies between minimal activation (1-2 experts) and full activation, with the exact number scaling proportionally to task complexity.
arXiv Detail & Related papers (2024-10-17T18:40:48Z)
SimSMoE: Solving Representational Collapse via Similarity Measure [34.20340688374905]
Sparse mixture of experts (SMoE) have emerged as an effective approach for scaling large language models while keeping a constant computational cost. We present Similarity-based Sparse Mixture of Experts (SimSMoE), a novel similarity of neural network algorithm, that guarantees a solution to the representation collapse issue.
arXiv Detail & Related papers (2024-06-22T16:10:45Z)
SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z)
Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning. In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity. We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z)
Benchmarking Robustness and Generalization in Multi-Agent Systems: A Case Study on Neural MMO [50.58083807719749]
We present the results of the second Neural MMO challenge, hosted at IJCAI 2022, which received 1600+ submissions. This competition targets robustness and generalization in multi-agent systems. We will open-source our benchmark including the environment wrapper, baselines, a visualization tool, and selected policies for further research.
arXiv Detail & Related papers (2023-08-30T07:16:11Z)
Building Robust Ensembles via Margin Boosting [98.56381714748096]
In adversarial robustness, a single model does not usually have enough power to defend against all possible adversarial attacks. We develop an algorithm for learning an ensemble with maximum margin. We show that our algorithm not only outperforms existing ensembling techniques, but also large models trained in an end-to-end fashion.
arXiv Detail & Related papers (2022-06-07T14:55:58Z)
A portfolio-based analysis method for competition results [0.8680676599607126]
I will describe a portfolio-based analysis method which can give complementary insights into the performance of participating solvers in a competition. The method is demonstrated on the results of the MiniZinc Challenges and new insights gained from the portfolio viewpoint are presented.
arXiv Detail & Related papers (2022-05-30T20:20:45Z)
Continual Competitive Memory: A Neural System for Online Task-Free Lifelong Learning [91.3755431537592]
We propose a novel form of unsupervised learning, continual competitive memory ( CCM) The resulting neural system is shown to offer an effective approach for combating catastrophic forgetting in online continual classification problems. We demonstrate that the proposed CCM system not only outperforms other competitive learning neural models but also yields performance that is competitive with several modern, state-of-the-art lifelong learning approaches.
arXiv Detail & Related papers (2021-06-24T20:12:17Z)
Towards robust and domain agnostic reinforcement learning competitions [12.731614722371376]
Reinforcement learning competitions have formed the basis for standard research benchmarks. Despite this, a majority of challenges suffer from the same fundamental problems. We present a new framework of competition design that promotes the development of algorithms that overcome these barriers.
arXiv Detail & Related papers (2021-06-07T16:15:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.