Related papers: StableMoE: Stable Routing Strategy for Mixture of Experts

StableMoE: Stable Routing Strategy for Mixture of Experts

URL: http://arxiv.org/abs/2204.08396v1
Date: Mon, 18 Apr 2022 16:48:19 GMT
Title: StableMoE: Stable Routing Strategy for Mixture of Experts
Authors: Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, Furu Wei
Abstract summary: Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead. We propose StableMoE with two training stages to address the routing fluctuation problem. Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
Score: 109.0602120199226
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead. We point out that existing learning-to-route MoE methods suffer from the routing fluctuation issue, i.e., the target expert of the same input may change along with training, but only one expert will be activated for the input during inference. The routing fluctuation tends to harm sample efficiency because the same input updates different experts but only one is finally used. In this paper, we propose StableMoE with two training stages to address the routing fluctuation problem. In the first training stage, we learn a balanced and cohesive routing strategy and distill it into a lightweight router decoupled from the backbone model. In the second training stage, we utilize the distilled router to determine the token-to-expert assignment and freeze it for a stable routing strategy. We validate our method on language modeling and multilingual machine translation. The results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.

Related papers

Dense Backpropagation Improves Training for Sparse Mixture-of-Experts [41.08173926456885]
We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters. Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead.
arXiv Detail & Related papers (2025-04-16T19:55:36Z)
Attention Is All You Need For Mixture-of-Depths Routing [5.419910566904439]
We introduce a novel attention-based routing mechanism A-MoD. A-MoD allows for more efficient training as it introduces no additional trainable parameters. It can increase the performance of the MoD model.
arXiv Detail & Related papers (2024-12-30T11:25:54Z)
LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers [79.07412045476872]
Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks. We show that performing the full of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps. We propose a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations.
arXiv Detail & Related papers (2024-12-17T01:12:35Z)
ElastiFormer: Learned Redundancy Reduction in Transformer via Self-Distillation [0.6281017402518722]
ElastiFormer adapts pretrained Transformer models into an elastic counterpart with variable inference time compute. routing modules are trained using self-distillation losses to minimize the differences between the output of the pretrained-model and their elastic counterparts.
arXiv Detail & Related papers (2024-11-22T16:11:14Z)
Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers [40.40923861822689]
Mixture of Depths (MoD) was introduced to dynamically adjust the computational depth by skipping less important layers. Despite its promise, current MoD approaches remain under-explored and face two main challenges. We propose Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the computational overhead associated with full model training. For the second challenge, we propose MindSkip, which deploys textitAttention with Dynamic Depths. This method preserves the model's performance while significantly enhancing computational and memory efficiency.
arXiv Detail & Related papers (2024-10-17T03:23:50Z)
Layerwise Recurrent Router for Mixture-of-Experts [42.36093735411238]
Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs. Current MoE models often display parameter inefficiency. We introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE)
arXiv Detail & Related papers (2024-08-13T10:25:13Z)
MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts [38.15244333975921]
MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training. Our method outperforms previous dominant Mixture-of-Experts models in terms of both perplexity (PPL) and downstream task performance.
arXiv Detail & Related papers (2024-07-13T09:22:33Z)
Sparse Backpropagation for MoE Training [118.31785160874024]
We introduce SparseMixer, a scalable gradient estimator that bridges the gap between backpropagation and sparse expert routing. Grounded in a numerical ODE framework, SparseMixer harnesses the mid-point method, a second-order ODE solver, to deliver precise gradient approximations. Applying SparseMixer to Switch Transformer on both pre-training and machine translation tasks, SparseMixer showcases considerable performance gain.
arXiv Detail & Related papers (2023-10-01T22:43:57Z)
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks [74.68583356645276]
In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis. We show for the first time that pMoE provably reduces the required number of training samples to achieve desirable generalization.
arXiv Detail & Related papers (2023-06-07T00:16:10Z)
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing [47.11171833082974]
We introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing. Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
arXiv Detail & Related papers (2022-12-10T03:44:16Z)
Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts) Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z)
Hash Layers For Large Sparse Models [48.90784451703753]
We modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence. We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods.
arXiv Detail & Related papers (2021-06-08T14:54:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.