MABViT -- Modified Attention Block Enhances Vision Transformers
- URL: http://arxiv.org/abs/2312.01324v2
- Date: Mon, 1 Jan 2024 13:27:15 GMT
- Title: MABViT -- Modified Attention Block Enhances Vision Transformers
- Authors: Mahesh Ramesh and Aswinkumar Ramkumar
- Abstract summary: We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem.
We implement the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent studies have demonstrated the effectiveness of Gated Linear Units
(GLU) in enhancing transformer models, particularly in Large Language Models
(LLMs). Additionally, utilizing a parallel configuration within each
Transformer block rather than the conventional serialized method has been
revealed to accelerate the training of LLMs without significantly impacting
performance. However, when the MLP and attention block were run in parallel for
the image classification task, we observed a noticeable decline in performance.
We propose a novel transformer variant that integrates non-linearity within the
attention block to tackle this problem. We implemented the GLU-based activation
function on the Value tensor, and this new technique surpasses the current
state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K
dataset while utilizing fewer parameters. It also supersedes the B/16 variant
while using only half the parameters. Furthermore, we provide results with the
GELU activation function variant to confirm our assertions. Lastly, we showcase
that the MABViT variants exhibit greater potential when utilized in deep
transformers compared to the standard architecture.
Related papers
- Efficient Visual Transformer by Learnable Token Merging [8.905020033545643]
We propose a novel transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer.
LTM-Transformer is compatible with many popular and compact transformer networks.
It renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers.
arXiv Detail & Related papers (2024-07-21T17:09:19Z) - ClipFormer: Key-Value Clipping of Transformers on Memristive Crossbars
for Write Noise Mitigation [6.853523674099236]
In-memory computing (IMC) crossbars based on Non-volatile Memories (NVMs) have emerged as a promising solution for accelerating transformers.
We find pre-trained Vision Transformers (ViTs) to be vulnerable on crossbars due to the impact of dynamically generate write noise.
We propose a new memristive crossbar platform to boost the non-ideal accuracies of pre-trained ViT models.
arXiv Detail & Related papers (2024-02-04T19:04:37Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - Efficient Attention-free Video Shift Transformers [56.87581500474093]
This paper tackles the problem of efficient video recognition.
Video transformers have recently dominated the efficiency (top-1 accuracy vs FLOPs) spectrum.
We extend our formulation in the video domain to construct Video Affine-Shift Transformer.
arXiv Detail & Related papers (2022-08-23T17:48:29Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - Transformer with a Mixture of Gaussian Keys [31.91701434633319]
Multi-head attention is a driving force behind state-of-the-art transformers.
Transformer-MGK replaces redundant heads in transformers with a mixture of keys at each head.
Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute.
arXiv Detail & Related papers (2021-10-16T23:43:24Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.