Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable
Transformers
- URL: http://arxiv.org/abs/2303.01610v1
- Date: Thu, 2 Mar 2023 22:12:51 GMT
- Title: Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable
Transformers
- Authors: Tianlong Chen, Zhenyu Zhang, Ajay Jaiswal, Shiwei Liu, Zhangyang Wang
- Abstract summary: We propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse.
SMoE-Dropout consists of a randomly and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time.
Our experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts.
- Score: 107.3726071306935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite their remarkable achievement, gigantic transformers encounter
significant drawbacks, including exorbitant computational and memory footprints
during training, as well as severe collapse evidenced by a high degree of
parameter redundancy. Sparsely-activated Mixture-of-Experts (SMoEs) have shown
promise to mitigate the issue of training efficiency, yet they are prone to (1)
redundant experts due to representational collapse; and (2) poor expert
scalability for inference and downstream fine-tuning, primarily due to
overfitting of the learned routing policy to the number of activated experts
during training. As recent research efforts are predominantly focused on
improving routing policies to encourage expert specializations, this work
focuses on exploring the overlooked scalability bottleneck of SMoEs and
leveraging it to effectively scale dense transformers. To this end, we propose
a new plug-and-play training framework, SMoE-Dropout, to enable scaling
transformers to better accuracy in their full capacity without collapse.
Specifically, SMoE-Dropout consists of a randomly initialized and fixed router
network to activate experts and gradually increases the activated expert number
as training progresses over time. Transformers trained by SMoE-Dropout
naturally exhibit a self-slimmable property subject to resource availability,
offering smooth and consistent performance boosts with an increase in activated
experts during inference or fine-tuning. Our extensive experiments demonstrate
the superior performance and substantial computation savings of SMoE-Dropout,
compared to dense training baselines with equivalent parameter counts. In
particular, our trained BERT outperforms its densely trained counterpart with
consistent improvements of {1.03%, 0.78%, 1.09%} on challenging reasoning tasks
{ASDiv-A, MAWPS, SVAMP}, respectively.
Related papers
- SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models.
The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss.
Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z) - A General and Efficient Training for Transformer via Token Expansion [44.002355107931805]
Vision Transformers (ViTs) typically require an extremely large training cost.
Existing methods have attempted to accelerate the training of ViTs, yet typically disregard method with accuracy dropping.
We propose a novel token growth scheme Token Expansion (termed ToE) to achieve consistent training acceleration for ViTs.
arXiv Detail & Related papers (2024-03-31T12:44:24Z) - PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation [61.57833648734164]
We propose a novel Parallel Yielding Re-Activation (PYRA) method for training-inference efficient task adaptation.
PYRA outperforms all competing methods under both low compression rate and high compression rate.
arXiv Detail & Related papers (2024-03-14T09:06:49Z) - Dynamic Layer Tying for Parameter-Efficient Transformers [65.268245109828]
We employ Reinforcement Learning to select layers during training and tie them together.
This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique.
In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
arXiv Detail & Related papers (2024-01-23T14:53:20Z) - The Emergence of Essential Sparsity in Large Pre-trained Models: The
Weights that Matter [113.35761858962522]
This paper studies induced sparse patterns across multiple large pre-trained vision and language transformers.
We propose the existence of essential sparsity defined with a sharp dropping point beyond which the performance declines much faster.
We also find essential sparsity to hold valid for N:M sparsity patterns as well as on modern-scale large language models.
arXiv Detail & Related papers (2023-06-06T15:49:09Z) - Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training.
We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z) - Regularized Evolutionary Population-Based Training [11.624954122221562]
This paper presents an algorithm called Population-Based Training (EPBT) that interleaves the training of a DNN's weights with the metalearning of loss functions.
EPBT results in faster, more accurate learning on image classification benchmarks.
arXiv Detail & Related papers (2020-02-11T06:28:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.