Gating Dropout: Communication-efficient Regularization for Sparsely
Activated Transformers
- URL: http://arxiv.org/abs/2205.14336v1
- Date: Sat, 28 May 2022 05:12:43 GMT
- Title: Gating Dropout: Communication-efficient Regularization for Sparsely
Activated Transformers
- Authors: Rui Liu, Young Jin Kim, Alexandre Muzio, Barzan Mozafari, Hany Hassan
Awadalla
- Abstract summary: We propose emphGating Dropout, which allows tokens to ignore the gating network and stay at their local machines.
Similar to traditional dropout, we also show that Gating Dropout has a regularization effect during training, resulting in improved generalization performance.
- Score: 78.77361169167149
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparsely activated transformers, such as Mixture of Experts (MoE), have
received great interest due to their outrageous scaling capability which
enables dramatical increases in model size without significant increases in
computational cost. To achieve this, MoE models replace the feedforward
sub-layer with Mixture-of-Experts sub-layer in transformers and use a gating
network to route each token to its assigned experts. Since the common practice
for efficient training of such models requires distributing experts and tokens
across different machines, this routing strategy often incurs huge
cross-machine communication cost because tokens and their assigned experts
likely reside in different machines. In this paper, we propose \emph{Gating
Dropout}, which allows tokens to ignore the gating network and stay at their
local machines, thus reducing the cross-machine communication. Similar to
traditional dropout, we also show that Gating Dropout has a regularization
effect during training, resulting in improved generalization performance. We
validate the effectiveness of Gating Dropout on multilingual machine
translation tasks. Our results demonstrate that Gating Dropout improves a
state-of-the-art MoE model with faster wall-clock time convergence rates and
better BLEU scores for a variety of model sizes and datasets.
Related papers
- Masked Mixers for Language Generation and Retrieval [0.0]
We observe poor input representation accuracy in transformers, but find more accurate representation in masked mixers.
Applied to TinyStories the masked mixer learns causal language tasks more efficiently than early transformer implementations.
We introduce an efficient training approach for retrieval models based on existing generative model embeddings.
arXiv Detail & Related papers (2024-09-02T22:17:18Z) - Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters [11.05223262950967]
Mixture of Experts (MoE) architectures have recently started burgeoning due to their ability to scale model's capacity while maintaining the computational cost affordable.
This paper attempts to demystify the use of MoE for parameter-efficient fine-tuning of Audio Spectrogram Transformers to audio and speech downstream tasks.
It exploits adapters as the experts and, leveraging the recent Soft MoE method, it relies on a soft assignment between the input tokens and experts to keep the computational time limited.
arXiv Detail & Related papers (2024-02-01T18:16:04Z) - Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable
Transformers [107.3726071306935]
We propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse.
SMoE-Dropout consists of a randomly and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time.
Our experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts.
arXiv Detail & Related papers (2023-03-02T22:12:51Z) - SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing [47.11171833082974]
We introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing.
Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
arXiv Detail & Related papers (2022-12-10T03:44:16Z) - AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for
Efficient Neural Machine Translation [104.0979785739202]
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks.
Existing MoE models mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network.
We develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints.
arXiv Detail & Related papers (2022-10-14T05:32:17Z) - Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost.
In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts)
Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z) - Enabling On-Device Training of Speech Recognition Models with Federated
Dropout [4.165917555996752]
Federated learning can be used to train machine learning models on the edge on local data that never leave devices.
We propose using federated dropout to reduce the size of client models while training a full-size model server-side.
arXiv Detail & Related papers (2021-10-07T17:22:40Z) - Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models.
We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.