Activator: GLU Activation Function as the Core Component of a Vision Transformer
- URL: http://arxiv.org/abs/2405.15953v3
- Date: Sat, 26 Jul 2025 21:37:44 GMT
- Title: Activator: GLU Activation Function as the Core Component of a Vision Transformer
- Authors: Abdullah Nazhat Abdullah, Tarkan Aydin,
- Abstract summary: This paper investigates the substituting and attention mechanism usually adopted for transformer architecture with an incorporating a linear unit (GLU) activation function structure with the aim of reducing the computational cost.<n>The results are significantly in support of the aims of this work, in which the focus was to extensively utilize GLU-baseds, establishing a more efficient but capable alternative to the traditional vision and the attention mechanism as the core component in the design of transformer architectures.
- Score: 1.3812010983144802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The transformer architecture has driven many successes in a variety of tasks within the field of deep learning, in particular the recent advances in natural language processing (NLP) culminating with large language models (LLM). Adding to that success, transformer architecture has found widespread interest from computer vision (CV) researchers and practitioners, allowing for many advancements in vision-related tasks and opening the door for multitask and multi-modal deep learning architectures that share the same principle of operation. One drawback to these architectures is their reliance on the scaled dot product attention mechanism with the softmax activation function, which is computationally expensive and requires large compute capabilities for both training and inference. This paper investigates substituting the MLP and attention mechanism usually adopted for transformer architecture with an architecture based on incorporating a gated linear unit (GLU) activation function structure with the aim of reducing the computational cost. The equalized experimental assessments conducted in this work show that the proposed modification with the targeted reductions in computational complexity offers competitive performance compared to the selected baseline architectures. The results are significantly in support of the aims of this work, in which the focus was to extensively utilize GLU-based MLPs, establishing a more efficient but capable alternative to the traditional MLP and the attention mechanism as the core component in the design of transformer architectures.
Related papers
- Small transformer architectures for task switching [2.7195102129095003]
It is non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches.<n>We show that standard transformers cannot solve a basic task switching reference model.<n>We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies.
arXiv Detail & Related papers (2025-08-06T14:01:05Z) - Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention [0.0]
Learnable Multi-Scale Wavelet Transformer (LMWT) is a novel architecture that replaces the standard dot-product self-attention.<n>We present the detailed mathematical formulation of the learnable Haar wavelet module and its integration into the transformer framework.<n>Our results indicate that the LMWT achieves competitive performance while offering substantial computational advantages.
arXiv Detail & Related papers (2025-04-08T22:16:54Z) - Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning [59.001091197106085]
Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously.
Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning.
We propose a novel approach dubbed Efficient Multi-Task Learning (EMTAL) by transforming a pre-trained Vision Transformer into an efficient multi-task learner.
arXiv Detail & Related papers (2025-01-12T17:41:23Z) - Cliqueformer: Model-Based Optimization with Structured Transformers [102.55764949282906]
Large neural networks excel at prediction tasks, but their application to design problems, such as protein engineering or materials discovery, requires solving offline model-based optimization (MBO) problems.<n>We present Cliqueformer, a transformer-based architecture that learns the black-box function's structure through functional graphical models (FGM)<n>Across various domains, including chemical and genetic design tasks, Cliqueformer demonstrates superior performance compared to existing methods.
arXiv Detail & Related papers (2024-10-17T00:35:47Z) - Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models.
SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details.
Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z) - Symmetric Dot-Product Attention for Efficient Training of BERT Language Models [5.838117137253223]
We propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture.
When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation.
arXiv Detail & Related papers (2024-06-10T15:24:15Z) - Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers [1.1499643186017316]
We propose Cross-Architecture Transfer Learning (XATL) to improve efficiency of Transformer Language Models.
Methodabbr significantly reduces the training time up to 2.5x times and converges to a better minimum with up to 2.6% stronger model on the LM benchmarks within the same compute budget.
arXiv Detail & Related papers (2024-04-03T12:27:36Z) - NiNformer: A Network in Network Transformer with Token Mixing as a Gating Function Generator [1.3812010983144802]
The attention mechanism was utilized in computer vision as the Vision Transformer ViT.
It comes with the drawback of being expensive and requiring datasets of considerable size for effective optimization.
This paper introduces a new computational block as an alternative to the standard ViT block that reduces the compute burdens.
arXiv Detail & Related papers (2024-03-04T19:08:20Z) - How Do Transformers Learn In-Context Beyond Simple Functions? A Case
Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations.
We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function.
We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z) - A Comprehensive Survey on Applications of Transformers for Deep Learning
Tasks [60.38369406877899]
Transformer is a deep neural network that employs a self-attention mechanism to comprehend the contextual relationships within sequential data.
transformer models excel in handling long dependencies between input sequence elements and enable parallel processing.
Our survey encompasses the identification of the top five application domains for transformer-based models.
arXiv Detail & Related papers (2023-06-11T23:13:51Z) - Exploring Transformers for Behavioural Biometrics: A Case Study in Gait
Recognition [0.7874708385247353]
This article intends to explore and propose novel gait biometric recognition systems based on Transformers.
Several state-of-the-art architectures (Vanilla, Informer, Autoformer, Block-Recurrent Transformer, and THAT) are considered in the experimental framework.
Experiments are carried out using the two popular public databases whuGAIT and OU-ISIR.
arXiv Detail & Related papers (2022-06-03T08:08:40Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Multi-Exit Vision Transformer for Dynamic Inference [88.17413955380262]
We propose seven different architectures for early exit branches that can be used for dynamic inference in Vision Transformer backbones.
We show that each one of our proposed architectures could prove useful in the trade-off between accuracy and speed.
arXiv Detail & Related papers (2021-06-29T09:01:13Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Twins: Revisiting Spatial Attention Design in Vision Transformers [81.02454258677714]
In this work, we demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes.
We propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT.
Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks.
arXiv Detail & Related papers (2021-04-28T15:42:31Z) - Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention.
We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.