Related papers: SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

URL: http://arxiv.org/abs/2303.15446v2
Date: Tue, 25 Jul 2023 19:56:00 GMT
Title: SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications
Authors: Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan
Abstract summary: We introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. We build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.
Score: 98.90623605283564
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially for deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed to combine the advantages of convolutions and self-attention for a better speed-accuracy trade-off, the expensive matrix multiplication operations in self-attention remain a bottleneck. In this work, we introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. Our design shows that the key-value interaction can be replaced with a linear layer without sacrificing any accuracy. Unlike previous state-of-the-art methods, our efficient formulation of self-attention enables its usage at all stages of the network. Using our proposed efficient additive attention, we build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2. Code: https://github.com/Amshaker/SwiftFormer

Related papers

S2AFormer: Strip Self-Attention for Efficient Vision Transformer [37.930090368513355]
Vision Transformer (ViT) has made significant advancements in computer vision.<n>Recent methods have combined the strengths of convolutions and self-attention to achieve better trade-offs.<n>We propose S2AFormer, an efficient Vision Transformer architecture featuring novel Strip Self-Attention (SSA)
arXiv Detail & Related papers (2025-05-28T10:17:23Z)
iFormer: Integrating ConvNet and Transformer for Mobile Application [0.6798775532273751]
iFormer integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks.
arXiv Detail & Related papers (2025-01-26T02:34:58Z)
CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z)
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [73.80247057590519]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability. We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Our model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K.
arXiv Detail & Related papers (2024-08-07T11:33:46Z)
CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z)
EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention [44.148667664413004]
We propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations. To address this, we present a cascaded group attention module feeding attention heads with different splits.
arXiv Detail & Related papers (2023-05-11T17:59:41Z)
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization [14.707312504365376]
We introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. We show that our model is 3.5x faster than CMT, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2023-03-24T17:58:32Z)
RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer [63.25665813125223]
We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation. It achieves better trade-off between performance and efficiency than CNN-based models. Experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer.
arXiv Detail & Related papers (2022-10-13T16:03:53Z)
EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K. Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z)
MobileOne: An Improved One millisecond Mobile Backbone [14.041480018494394]
We analyze different metrics by deploying several mobile-friendly networks on a mobile device. We design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12. We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile.
arXiv Detail & Related papers (2022-06-08T17:55:11Z)
Separable Self-attention for Mobile Vision Transformers [34.32399598443582]
This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection.
arXiv Detail & Related papers (2022-06-06T15:31:35Z)
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.