Related papers: Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

URL: http://arxiv.org/abs/2205.14336v1
Date: Sat, 28 May 2022 05:12:43 GMT
Title: Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers
Authors: Rui Liu, Young Jin Kim, Alexandre Muzio, Barzan Mozafari, Hany Hassan Awadalla
Abstract summary: We propose emphGating Dropout, which allows tokens to ignore the gating network and stay at their local machines. Similar to traditional dropout, we also show that Gating Dropout has a regularization effect during training, resulting in improved generalization performance.
Score: 78.77361169167149
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due to their outrageous scaling capability which enables dramatical increases in model size without significant increases in computational cost. To achieve this, MoE models replace the feedforward sub-layer with Mixture-of-Experts sub-layer in transformers and use a gating network to route each token to its assigned experts. Since the common practice for efficient training of such models requires distributing experts and tokens across different machines, this routing strategy often incurs huge cross-machine communication cost because tokens and their assigned experts likely reside in different machines. In this paper, we propose \emph{Gating Dropout}, which allows tokens to ignore the gating network and stay at their local machines, thus reducing the cross-machine communication. Similar to traditional dropout, we also show that Gating Dropout has a regularization effect during training, resulting in improved generalization performance. We validate the effectiveness of Gating Dropout on multilingual machine translation tasks. Our results demonstrate that Gating Dropout improves a state-of-the-art MoE model with faster wall-clock time convergence rates and better BLEU scores for a variety of model sizes and datasets.

Related papers

ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration [61.579842548990754]
Mixture-of-Experts (MoE) Transformer, the backbone of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones.
arXiv Detail & Related papers (2025-03-10T03:15:54Z)
UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths [17.68867710994329]
UniMoD is a task-aware token pruning method that employs a separate router for each task to determine which tokens should be pruned. We apply our method to Show-o and Emu3, reducing training FLOPs by approximately 15% in Show-o and 40% in Emu3, while maintaining or improving performance on several benchmarks.
arXiv Detail & Related papers (2025-02-10T13:52:52Z)
Masked Mixers for Language Generation and Retrieval [0.0]
We observe poor input representation accuracy in transformers, but find more accurate representation in masked mixers. Applied to TinyStories the masked mixer learns causal language tasks more efficiently than early transformer implementations. We introduce an efficient training approach for retrieval models based on existing generative model embeddings.
arXiv Detail & Related papers (2024-09-02T22:17:18Z)
Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters [11.05223262950967]
Mixture of Experts (MoE) architectures have recently started burgeoning due to their ability to scale model's capacity while maintaining the computational cost affordable. This paper attempts to demystify the use of MoE for parameter-efficient fine-tuning of Audio Spectrogram Transformers to audio and speech downstream tasks. It exploits adapters as the experts and, leveraging the recent Soft MoE method, it relies on a soft assignment between the input tokens and experts to keep the computational time limited.
arXiv Detail & Related papers (2024-02-01T18:16:04Z)
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers [107.3726071306935]
We propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse. SMoE-Dropout consists of a randomly and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time. Our experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts.
arXiv Detail & Related papers (2023-03-02T22:12:51Z)
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing [47.11171833082974]
We introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing. Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
arXiv Detail & Related papers (2022-12-10T03:44:16Z)
AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation [104.0979785739202]
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks. Existing MoE models mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network. We develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints.
arXiv Detail & Related papers (2022-10-14T05:32:17Z)
Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts) Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z)
Enabling On-Device Training of Speech Recognition Models with Federated Dropout [4.165917555996752]
Federated learning can be used to train machine learning models on the edge on local data that never leave devices. We propose using federated dropout to reduce the size of client models while training a full-size model server-side.
arXiv Detail & Related papers (2021-10-07T17:22:40Z)
Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models. We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z)
The Cascade Transformer: an Application for Efficient Answer Sentence Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers. When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.