Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
- URL: http://arxiv.org/abs/2504.12463v2
- Date: Fri, 18 Apr 2025 02:58:46 GMT
- Title: Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
- Authors: Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Therien, Supriyo Chakraborty, Tom Goldstein,
- Abstract summary: We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters.<n>Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead.
- Score: 41.08173926456885
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to training instability and suboptimal performance. We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters. Our method, which we refer to as Default MoE, substitutes missing expert activations with default outputs consisting of an exponential moving average of expert outputs previously seen over the course of training. This allows the router to receive signals from every expert for each token, leading to significant improvements in training performance. Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead. Code: https://github.com/vatsal0/default-moe.
Related papers
- Accelerating Mixture-of-Experts Training with Adaptive Expert Replication [1.8764600940655036]
We introduce SwiftMoE, an adaptive MoE training system.
SwiftMoE right-sizes the GPU resources allocated to each expert, on a per-iteration basis, with minimal overheads.
Compared to state-of-the-art MoE training systems, DeepSpeed and FlexMoE, SwiftMoE is able to achieve a 30.5% and 25.9% faster time-to-convergence.
arXiv Detail & Related papers (2025-04-28T15:58:55Z) - Continual Pre-training of MoEs: How robust is your router? [25.438359533860954]
MoEs benefit from improved sample efficiency at training time and achieve much stronger performance.<n>Many closed-source and open-source frontier language models have adopted an MoE architecture.<n>We show that MoE LLMs maintain their sample efficiency (relative to a FLOP-matched dense model) during CPT and that they can match the performance of a fully re-trained MoE at a fraction of the cost.
arXiv Detail & Related papers (2025-03-06T22:55:01Z) - Stepping Forward on the Last Mile [8.756033984943178]
We propose a series of algorithm enhancements that further reduce the memory footprint, and the accuracy gap compared to backpropagation.
Our results demonstrate that on the last mile of model customization on edge devices, training with fixed-point forward gradients is a feasible and practical approach.
arXiv Detail & Related papers (2024-11-06T16:33:21Z) - Exploring the Benefit of Activation Sparsity in Pre-training [117.25661020250658]
We study how activation properties change during pre-training.
We propose Switchable Sparse-Dense Learning (SSD)
SSD achieves comparable performance with identical model size and reduces pre-training costs.
arXiv Detail & Related papers (2024-10-04T13:53:33Z) - Sparse Backpropagation for MoE Training [118.31785160874024]
We introduce SparseMixer, a scalable gradient estimator that bridges the gap between backpropagation and sparse expert routing.
Grounded in a numerical ODE framework, SparseMixer harnesses the mid-point method, a second-order ODE solver, to deliver precise gradient approximations.
Applying SparseMixer to Switch Transformer on both pre-training and machine translation tasks, SparseMixer showcases considerable performance gain.
arXiv Detail & Related papers (2023-10-01T22:43:57Z) - Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable
Transformers [107.3726071306935]
We propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse.
SMoE-Dropout consists of a randomly and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time.
Our experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts.
arXiv Detail & Related papers (2023-03-02T22:12:51Z) - SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing [47.11171833082974]
We introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing.
Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
arXiv Detail & Related papers (2022-12-10T03:44:16Z) - StableMoE: Stable Routing Strategy for Mixture of Experts [109.0602120199226]
Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead.
We propose StableMoE with two training stages to address the routing fluctuation problem.
Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
arXiv Detail & Related papers (2022-04-18T16:48:19Z) - MEST: Accurate and Fast Memory-Economic Sparse Training Framework on the
Edge [72.16021611888165]
This paper proposes a novel Memory-Economic Sparse Training (MEST) framework targeting for accurate and fast execution on edge devices.
The proposed MEST framework consists of enhancements by Elastic Mutation (EM) and Soft Memory Bound (&S)
Our results suggest that unforgettable examples can be identified in-situ even during the dynamic exploration of sparsity masks.
arXiv Detail & Related papers (2021-10-26T21:15:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.