Transformer with a Mixture of Gaussian Keys
- URL: http://arxiv.org/abs/2110.08678v1
- Date: Sat, 16 Oct 2021 23:43:24 GMT
- Title: Transformer with a Mixture of Gaussian Keys
- Authors: Tam Nguyen, Tan M. Nguyen, Dung Le, Khuong Nguyen, Anh Tran, Richard
G. Baraniuk, Nhat Ho and Stanley J. Osher
- Abstract summary: Multi-head attention is a driving force behind state-of-the-art transformers.
Transformer-MGK replaces redundant heads in transformers with a mixture of keys at each head.
Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute.
- Score: 31.91701434633319
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-head attention is a driving force behind state-of-the-art transformers
which achieve remarkable performance across a variety of natural language
processing (NLP) and computer vision tasks. It has been observed that for many
applications, those attention heads learn redundant embedding, and most of them
can be removed without degrading the performance of the model. Inspired by this
observation, we propose Transformer with a Mixture of Gaussian Keys
(Transformer-MGK), a novel transformer architecture that replaces redundant
heads in transformers with a mixture of keys at each head. These mixtures of
keys follow a Gaussian mixture model and allow each attention head to focus on
different parts of the input sequence efficiently. Compared to its conventional
transformer counterpart, Transformer-MGK accelerates training and inference,
has fewer parameters, and requires less FLOPs to compute while achieving
comparable or better accuracy across tasks. Transformer-MGK can also be easily
extended to use with linear attentions. We empirically demonstrate the
advantage of Transformer-MGK in a range of practical applications including
language modeling and tasks that involve very long sequences. On the
Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads
attain comparable or better performance to the baseline transformers with 8
heads.
Related papers
- Shrinking the Giant : Quasi-Weightless Transformers for Low Energy Inference [0.30104001512119216]
Building models with fast and energy-efficient inference is imperative to enable a variety of transformer-based applications.
We build on an approach for learning LUT networks directly via an Extended Finite Difference method.
This allows for a computational and energy-efficient inference solution for transformer-based models.
arXiv Detail & Related papers (2024-11-04T05:38:56Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - Do Efficient Transformers Really Save Computation? [34.15764596496696]
We focus on the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer.
Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size.
We identify a class of DP problems for which these models can be more efficient than the standard Transformer.
arXiv Detail & Related papers (2024-02-21T17:00:56Z) - ClipFormer: Key-Value Clipping of Transformers on Memristive Crossbars
for Write Noise Mitigation [6.853523674099236]
In-memory computing (IMC) crossbars based on Non-volatile Memories (NVMs) have emerged as a promising solution for accelerating transformers.
We find pre-trained Vision Transformers (ViTs) to be vulnerable on crossbars due to the impact of dynamically generate write noise.
We propose a new memristive crossbar platform to boost the non-ideal accuracies of pre-trained ViT models.
arXiv Detail & Related papers (2024-02-04T19:04:37Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - DA-Transformer: Distance-aware Transformer [87.20061062572391]
DA-Transformer is a distance-aware Transformer that can exploit the real distance.
In this paper, we propose DA-Transformer, which is a distance-aware Transformer that can exploit the real distance.
arXiv Detail & Related papers (2020-10-14T10:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.