Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
- URL: http://arxiv.org/abs/2408.15664v1
- Date: Wed, 28 Aug 2024 09:31:09 GMT
- Title: Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
- Authors: Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, Damai Dai,
- Abstract summary: An unbalanced expert load will lead to routing collapse or increased computational overhead.
We propose Loss-Free Balancing, featured by an auxiliary-loss-free load balancing strategy.
We validate the performance of Loss-Free Balancing on MoE models with up to 3B parameters trained on up to 200B tokens.
- Score: 13.413587367600444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to routing collapse or increased computational overhead. Existing methods commonly employ an auxiliary loss to encourage load balance, but a large auxiliary loss will introduce non-negligible interference gradients into training and thus impair the model performance. In order to control load balance while not producing undesired gradients during training, we propose Loss-Free Balancing, featured by an auxiliary-loss-free load balancing strategy. To be specific, before the top-K routing decision, Loss-Free Balancing will first apply an expert-wise bias to the routing scores of each expert. By dynamically updating the bias of each expert according to its recent load, Loss-Free Balancing can consistently maintain a balanced distribution of expert load. In addition, since Loss-Free Balancing does not produce any interference gradients, it also elevates the upper bound of model performance gained from MoE training. We validate the performance of Loss-Free Balancing on MoE models with up to 3B parameters trained on up to 200B tokens. Experimental results show that Loss-Free Balancing achieves both better performance and better load balance compared with traditional auxiliary-loss-controlled load balancing strategies.
Related papers
- Load Balancing Mixture of Experts with Similarity Preserving Routers [37.348178220494226]
Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks.<n>We introduce a novel load balancing loss that preserves token-wise relational structure.<n>Our results show that applying our loss to the router results in 36% faster convergence and lower redundancy.
arXiv Detail & Related papers (2025-06-16T22:22:59Z) - Dual-Balancing for Physics-Informed Neural Networks [5.8096456298528745]
Physics-informed neural networks (PINNs) have emerged as a new learning paradigm for solving partial differential equations (PDEs)<n> PINNs still suffer from poor accuracy and slow convergence due to the intractable multi-objective optimization issue.<n>We propose a novel Dual-Balanced PINN (DB-PINN), which dynamically adjusts loss weights by integrating inter-balancing and intra-balancing.
arXiv Detail & Related papers (2025-05-16T11:00:54Z) - Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts [9.393481672669564]
Under expert parallelism, the Mixture of Experts (MoE) suffers from inference inefficiencies due to imbalanced token-to-expert assignment.<n>We propose textittextbfCapacity-Aware Token Drop, which enforces expert capacity limits by discarding excess tokens from overloaded experts.<n>We also introduce textittextbfCapacity-Aware Expanded Drop, which allows tokens to include additional local experts in their candidate set before enforcing strict local capacity constraints.
arXiv Detail & Related papers (2025-03-07T01:11:39Z) - IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models [68.55148272295916]
IntLoRA adapts quantized diffusion models with integer-type low-rank parameters to include inference efficiency during tuning.<n>During inference, IntLoRA weights can be seamlessly merged into pre-trained weights to directly obtain quantized downstream weights without PTQ.
arXiv Detail & Related papers (2024-10-29T05:50:17Z) - Mind the Graph When Balancing Data for Fairness or Robustness [73.03155969727038]
We define conditions on the training distribution for data balancing to lead to fair or robust models.
Our results show that, in many cases, the balanced distribution does not correspond to selectively removing the undesired dependencies.
Overall, our results highlight the importance of taking the causal graph into account before performing data balancing.
arXiv Detail & Related papers (2024-06-25T10:16:19Z) - Simplifying Neural Network Training Under Class Imbalance [77.39968702907817]
Real-world datasets are often highly class-imbalanced, which can adversely impact the performance of deep learning models.
The majority of research on training neural networks under class imbalance has focused on specialized loss functions, sampling techniques, or two-stage training procedures.
We demonstrate that simply tuning existing components of standard deep learning pipelines, such as the batch size, data augmentation, and label smoothing, can achieve state-of-the-art performance without any such specialized class imbalance methods.
arXiv Detail & Related papers (2023-12-05T05:52:44Z) - Stabilizing RLHF through Advantage Model and Selective Rehearsal [57.504894664689]
Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences remains a significant challenge.
This challenge is characterized by various instabilities, such as reward hacking and catastrophic forgetting.
We propose two innovations to stabilize RLHF training: 1) Advantage Model, which directly models advantage score and regulates score distributions across tasks to prevent reward hacking; and 2) Selective Rehearsal, which mitigates catastrophic forgetting by strategically selecting data for PPO training and knowledge rehearsing.
arXiv Detail & Related papers (2023-09-18T23:06:32Z) - Communication Load Balancing via Efficient Inverse Reinforcement
Learning [13.052338083552863]
In this work, we tackle the communication load balancing problem from an inverse reinforcement learning (IRL) approach.
We infer a reward function from a set of demonstrations, and then learn a reinforcement learning load balancing policy with the inferred reward function.
Compared to classical RL-based solution, the proposed solution can be more general and more suitable for real-world scenarios.
arXiv Detail & Related papers (2023-03-22T22:23:23Z) - Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable
Transformers [107.3726071306935]
We propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse.
SMoE-Dropout consists of a randomly and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time.
Our experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts.
arXiv Detail & Related papers (2023-03-02T22:12:51Z) - Learning to Re-weight Examples with Optimal Transport for Imbalanced
Classification [74.62203971625173]
Imbalanced data pose challenges for deep learning based classification models.
One of the most widely-used approaches for tackling imbalanced data is re-weighting.
We propose a novel re-weighting method based on optimal transport (OT) from a distributional point of view.
arXiv Detail & Related papers (2022-08-05T01:23:54Z) - Phased Progressive Learning with Coupling-Regulation-Imbalance Loss for
Imbalanced Classification [11.673344551762822]
Deep neural networks generally perform poorly with datasets that suffer from quantity imbalance and classification difficulty imbalance between different classes.
A phased progressive learning schedule was proposed for smoothly transferring the training emphasis from representation learning to upper classifier training.
Our code will be open source soon.
arXiv Detail & Related papers (2022-05-24T14:46:39Z) - Neural Collapse Inspired Attraction-Repulsion-Balanced Loss for
Imbalanced Learning [97.81549071978789]
We propose Attraction-Repulsion-Balanced Loss (ARB-Loss) to balance the different components of the gradients.
We perform experiments on the large-scale classification and segmentation datasets and our ARB-Loss can achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-04-19T08:23:23Z) - Reinforced Workload Distribution Fairness [3.7384509727711923]
This paper proposes a distributed reinforcement learning mechanism to-with no active load balancer state monitoring and limited network observations-improve the fairness of the workload distribution achieved by a load balancer.
Preliminary results show promise in RLbased load balancing algorithms, and identify additional challenges and future research directions.
arXiv Detail & Related papers (2021-10-29T07:51:26Z) - Balance-Oriented Focal Loss with Linear Scheduling for Anchor Free
Object Detection [1.69146632099647]
We propose Balance-oriented focal loss that can induce balanced learning by considering both background and foreground balance.
By improving the focal loss in terms of balancing foreground classes, our method achieves AP gains of +1.2 in MS-COCO for the anchor free real-time detector.
arXiv Detail & Related papers (2020-12-26T15:24:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.