Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts
- URL: http://arxiv.org/abs/2503.16057v2
- Date: Tue, 25 Mar 2025 08:56:54 GMT
- Title: Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts
- Authors: Yike Yuan, Ziyu Wang, Zihao Huang, Defa Zhu, Xun Zhou, Jingyi Yu, Qiyang Min,
- Abstract summary: We introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race.<n>By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens.
- Score: 33.39800923804871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.
Related papers
- On the effectiveness of discrete representations in sparse mixture of experts [33.809432499123275]
We propose a new architecture dubbed Vector-Quantized Mixture of Experts (VQMoE)
VQMoE is an effective solution for scaling up model capacity without increasing the computational costs.
We show that VQMoE achieves a 28% improvement in routers compared to other SMoE routing methods.
arXiv Detail & Related papers (2024-11-28T22:32:01Z) - ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts [71.11994027685974]
We study the potential of applying MoE to vision through a comprehensive study on image classification and semantic segmentation.
We observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design.
We introduce a shared expert to learn and capture common knowledge, serving as an effective way to construct stable ViMoE.
arXiv Detail & Related papers (2024-10-21T07:51:17Z) - Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency.
We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference.
Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z) - MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection [54.545054873239295]
Deepfakes have recently raised significant trust issues and security concerns among the public.
ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance.
This work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach.
arXiv Detail & Related papers (2024-04-12T13:02:08Z) - Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations.
A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization.
We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z) - FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion [29.130355774088205]
FuseMoE is a mixture-of-experts framework incorporated with an innovative gating function.
Designed to integrate a diverse number of modalities, FuseMoE is effective in managing scenarios with missing modalities and irregularly sampled data trajectories.
arXiv Detail & Related papers (2024-02-05T17:37:46Z) - CompeteSMoE -- Effective Training of Sparse Mixture of Experts via
Competition [52.2034494666179]
Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width.
We propose a competition mechanism to address this fundamental challenge of representation collapse.
By routing inputs only to experts with the highest neural response, we show that, under mild assumptions, competition enjoys the same convergence rate as the optimal estimator.
arXiv Detail & Related papers (2024-02-04T15:17:09Z) - Exploiting Inter-Layer Expert Affinity for Accelerating
Mixture-of-Experts Model Inference [3.217776693788795]
We propose a lightweight optimization technique called ExFlow to largely accelerate the inference of pre-trained MoE models.
By exploiting the inter-layer expert affinity, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation.
Our solution beats the cutting-edge MoE implementations with experts from 8 to 64, with up to 2.2x improvement in inference throughput.
arXiv Detail & Related papers (2024-01-16T14:16:47Z) - Soft Merging of Experts with Adaptive Routing [38.962451264172856]
We introduce Soft Merging of Experts with Adaptive Routing (SMEAR)
SMEAR avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters.
We empirically validate that models using SMEAR outperform models that route based on metadata or learn sparse routing through gradient estimation.
arXiv Detail & Related papers (2023-06-06T15:04:31Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z) - On the Representation Collapse of Sparse Mixture of Experts [102.83396489230375]
Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead.
It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations.
However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse.
arXiv Detail & Related papers (2022-04-20T01:40:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.