Activator: GLU Activation Function as the Core Component of a Vision Transformer
- URL: http://arxiv.org/abs/2405.15953v2
- Date: Sun, 4 Aug 2024 00:25:10 GMT
- Title: Activator: GLU Activation Function as the Core Component of a Vision Transformer
- Authors: Abdullah Nazhat Abdullah, Tarkan Aydin,
- Abstract summary: Transformer architecture currently represents the main driver behind many successes in a variety of tasks addressed by deep learning.
This paper investigates substituting the attention mechanism usually adopted for transformer architecture with incorporating linear gated unit (GLU) activation within a multi-layer perceptron architecture.
- Score: 1.3812010983144802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer architecture currently represents the main driver behind many successes in a variety of tasks addressed by deep learning, especially the recent advances in natural language processing (NLP) culminating with large language models (LLM). In addition, transformer architecture has found a wide spread of interest from computer vision (CV) researchers and practitioners, allowing for many advancements in vision-related tasks and opening the door for multi-task and multi-modal deep learning architectures that share the same principle of operation. One drawback to these architectures is their reliance on the scaled dot product attention mechanism with the softmax activation function, which is computationally expensive and requires large compute capabilities both for training and inference. This paper investigates substituting the attention mechanism usually adopted for transformer architecture with an architecture incorporating gated linear unit (GLU) activation within a multi-layer perceptron (MLP) structure in conjunction with the default MLP incorporated in the traditional transformer design. Another step forward taken by this paper is to eliminate the second non-gated MLP to further reduce the computational cost. Experimental assessments conducted by this research show that both proposed modifications and reductions offer competitive performance in relation to baseline architectures, in support of the aims of this work in establishing a more efficient yet capable alternative to the traditional attention mechanism as the core component in designing transformer architectures.
Related papers
- Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models.
SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details.
Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z) - Symmetric Dot-Product Attention for Efficient Training of BERT Language Models [5.838117137253223]
We propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture.
When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation.
arXiv Detail & Related papers (2024-06-10T15:24:15Z) - Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers [1.1499643186017316]
We propose Cross-Architecture Transfer Learning (XATL) to improve efficiency of Transformer Language Models.
Methodabbr significantly reduces the training time up to 2.5x times and converges to a better minimum with up to 2.6% stronger model on the LM benchmarks within the same compute budget.
arXiv Detail & Related papers (2024-04-03T12:27:36Z) - NiNformer: A Network in Network Transformer with Token Mixing as a Gating Function Generator [1.3812010983144802]
The attention mechanism was utilized in computer vision as the Vision Transformer ViT.
It comes with the drawback of being expensive and requiring datasets of considerable size for effective optimization.
This paper introduces a new computational block as an alternative to the standard ViT block that reduces the compute burdens.
arXiv Detail & Related papers (2024-03-04T19:08:20Z) - A Comprehensive Survey on Applications of Transformers for Deep Learning
Tasks [60.38369406877899]
Transformer is a deep neural network that employs a self-attention mechanism to comprehend the contextual relationships within sequential data.
transformer models excel in handling long dependencies between input sequence elements and enable parallel processing.
Our survey encompasses the identification of the top five application domains for transformer-based models.
arXiv Detail & Related papers (2023-06-11T23:13:51Z) - Exploring Transformers for Behavioural Biometrics: A Case Study in Gait
Recognition [0.7874708385247353]
This article intends to explore and propose novel gait biometric recognition systems based on Transformers.
Several state-of-the-art architectures (Vanilla, Informer, Autoformer, Block-Recurrent Transformer, and THAT) are considered in the experimental framework.
Experiments are carried out using the two popular public databases whuGAIT and OU-ISIR.
arXiv Detail & Related papers (2022-06-03T08:08:40Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Multi-Exit Vision Transformer for Dynamic Inference [88.17413955380262]
We propose seven different architectures for early exit branches that can be used for dynamic inference in Vision Transformer backbones.
We show that each one of our proposed architectures could prove useful in the trade-off between accuracy and speed.
arXiv Detail & Related papers (2021-06-29T09:01:13Z) - Twins: Revisiting Spatial Attention Design in Vision Transformers [81.02454258677714]
In this work, we demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes.
We propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT.
Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks.
arXiv Detail & Related papers (2021-04-28T15:42:31Z) - Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention.
We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.