Dual Path Transformer with Partition Attention
- URL: http://arxiv.org/abs/2305.14768v1
- Date: Wed, 24 May 2023 06:17:53 GMT
- Title: Dual Path Transformer with Partition Attention
- Authors: Zhengkai Jiang and Liang Liu and Jiangning Zhang and Yabiao Wang and
Mingang Chen and Chengjie Wang
- Abstract summary: We present a novel attention mechanism, called dual attention, which is both efficient and effective.
We evaluate the effectiveness of our model on several computer vision tasks, including image classification on ImageNet, object detection on COCO, and semantic segmentation on Cityscapes.
The proposed DualFormer-XS achieves 81.5% top-1 accuracy on ImageNet, outperforming the recent state-of-the-artiT-XS by 0.6% top-1 accuracy with much higher throughput.
- Score: 26.718318398951933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a novel attention mechanism, called dual attention,
which is both efficient and effective. The dual attention mechanism consists of
two parallel components: local attention generated by Convolutional Neural
Networks (CNNs) and long-range attention generated by Vision Transformers
(ViTs). To address the high computational complexity and memory footprint of
vanilla Multi-Head Self-Attention (MHSA), we introduce a novel Multi-Head
Partition-wise Attention (MHPA) mechanism. The partition-wise attention
approach models both intra-partition and inter-partition attention
simultaneously. Building on the dual attention block and partition-wise
attention mechanism, we present a hierarchical vision backbone called
DualFormer. We evaluate the effectiveness of our model on several computer
vision tasks, including image classification on ImageNet, object detection on
COCO, and semantic segmentation on Cityscapes. Specifically, the proposed
DualFormer-XS achieves 81.5\% top-1 accuracy on ImageNet, outperforming the
recent state-of-the-art MPViT-XS by 0.6\% top-1 accuracy with much higher
throughput.
Related papers
- iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency [0.0]
We introduce iiANET (Inception Inspired Attention Network), an efficient hybrid model designed to capture long-range dependencies in complex images.
The fundamental building block, iiABlock, integrates global 2D-MHSA (Multi-Head Self-Attention) with Registers, MBConv2 (MobileNetV2-based convolution), and dilated convolution in parallel.
We serially integrate an ECANET (Efficient Channel Attention Network) at the end of each iiABlock to calibrate channel-wise attention for enhanced model performance.
arXiv Detail & Related papers (2024-07-10T12:39:02Z) - NiNformer: A Network in Network Transformer with Token Mixing as a Gating Function Generator [1.3812010983144802]
The attention mechanism was utilized in computer vision as the Vision Transformer ViT.
It comes with the drawback of being expensive and requiring datasets of considerable size for effective optimization.
This paper introduces a new computational block as an alternative to the standard ViT block that reduces the compute burdens.
arXiv Detail & Related papers (2024-03-04T19:08:20Z) - Bilateral Network with Residual U-blocks and Dual-Guided Attention for
Real-time Semantic Segmentation [18.393208069320362]
We design a new fusion mechanism for two-branch architecture which is guided by attention computation.
To be precise, we use the Dual-Guided Attention (DGA) module we proposed to replace some multi-scale transformations.
Experiments on Cityscapes and CamVid dataset show the effectiveness of our method.
arXiv Detail & Related papers (2023-10-31T09:20:59Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - BOAT: Bilateral Local Attention Vision Transformer [70.32810772368151]
Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large.
Recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows.
We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention.
arXiv Detail & Related papers (2022-01-31T07:09:50Z) - Couplformer:Rethinking Vision Transformer with Coupling Attention Map [7.789667260916264]
The Transformer model has demonstrated its outstanding performance in the computer vision domain.
We propose a novel memory economy attention mechanism named Couplformer, which decouples the attention map into two sub-matrices.
Experiments show that the Couplformer can significantly decrease 28% memory consumption compared with regular Transformer.
arXiv Detail & Related papers (2021-12-10T10:05:35Z) - PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [114.8051035856023]
We propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy.
Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification.
arXiv Detail & Related papers (2021-08-07T11:30:54Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - DMSANet: Dual Multi Scale Attention Network [0.0]
We propose a new attention module that not only achieves the best performance but also has lesser parameters compared to most existing models.
Our attention module can easily be integrated with other convolutional neural networks because of its lightweight nature.
arXiv Detail & Related papers (2021-06-13T10:31:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.