SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
- URL: http://arxiv.org/abs/2312.07987v3
- Date: Mon, 30 Sep 2024 21:19:29 GMT
- Title: SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
- Authors: Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber,
- Abstract summary: SwitchHead is an effective MoE method for the attention layer.
Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer.
SwitchHead matches the perplexity of standard models with only 44% compute and 27% memory usage.
- Score: 39.09650673080772
- License:
- Abstract: Despite many recent works on Mixture of Experts (MoEs) for resource-efficient Transformer language models, existing methods mostly focus on MoEs for feedforward layers. Previous attempts at extending MoE to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel SwitchHead is an effective MoE method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the language modeling performance of the baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer. SwitchHead can also be combined with MoE feedforward layers, resulting in fully-MoE "SwitchAll" Transformers. For our 262M parameter model trained on C4, SwitchHead matches the perplexity of standard models with only 44% compute and 27% memory usage. Zero-shot experiments on downstream tasks confirm the performance of SwitchHead, e.g., achieving more than 3.5% absolute improvements on BliMP compared to the baseline with an equal compute resource.
Related papers
- Shrinking the Giant : Quasi-Weightless Transformers for Low Energy Inference [0.30104001512119216]
Building models with fast and energy-efficient inference is imperative to enable a variety of transformer-based applications.
We build on an approach for learning LUT networks directly via an Extended Finite Difference method.
This allows for a computational and energy-efficient inference solution for transformer-based models.
arXiv Detail & Related papers (2024-11-04T05:38:56Z) - Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules [96.21649779507831]
We propose a novel architecture dubbed mixture-of-modules (MoM)
MoM is motivated by an intuition that any layer, regardless of its position, can be used to compute a token.
We show that MoM provides not only a unified framework for Transformers but also a flexible and learnable approach for reducing redundancy.
arXiv Detail & Related papers (2024-07-09T08:50:18Z) - An Empirical Study of Mamba-based Language Models [69.74383762508805]
Selective state-space models (SSMs) like Mamba overcome some shortcomings of Transformers.
We present a direct comparison between 8B-context Mamba, Mamba-2, and Transformer models trained on the same datasets.
We find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks.
arXiv Detail & Related papers (2024-06-12T05:25:15Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation [53.675725490807615]
We introduce SDPose, a new self-distillation method for improving the performance of small transformer-based models.
SDPose-T obtains 69.7% mAP with 4.4M parameters and 1.8 GFLOPs, while SDPose-S-V2 obtains 73.5% mAP on the MSCOCO validation dataset.
arXiv Detail & Related papers (2024-04-04T15:23:14Z) - Memory-efficient Stochastic methods for Memory-based Transformers [3.360916255196531]
Memory-based transformers can require a large amount of memory and can be quite inefficient.
We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers.
arXiv Detail & Related papers (2023-11-14T12:37:25Z) - AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for
Efficient Neural Machine Translation [104.0979785739202]
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks.
Existing MoE models mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network.
We develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints.
arXiv Detail & Related papers (2022-10-14T05:32:17Z) - Transformer with a Mixture of Gaussian Keys [31.91701434633319]
Multi-head attention is a driving force behind state-of-the-art transformers.
Transformer-MGK replaces redundant heads in transformers with a mixture of keys at each head.
Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute.
arXiv Detail & Related papers (2021-10-16T23:43:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.