MAFormer: A Transformer Network with Multi-scale Attention Fusion for
Visual Recognition
- URL: http://arxiv.org/abs/2209.01620v1
- Date: Wed, 31 Aug 2022 06:29:27 GMT
- Title: MAFormer: A Transformer Network with Multi-scale Attention Fusion for
Visual Recognition
- Authors: Yunhao Wang, Huixin Sun, Xiaodi Wang, Bin Zhang, Chao Li, Ying Xin,
Baochang Zhang, Errui Ding, Shumin Han
- Abstract summary: We introduce Multi-scale Attention Fusion into transformer (MAFormer)
MAFormer explores local aggregation and global feature extraction in a dual-stream framework for visual recognition.
Our MAFormer achieves state-of-the-art performance on common vision tasks.
- Score: 45.68567088645708
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformer and its variants have demonstrated great potential in
various computer vision tasks. But conventional vision transformers often focus
on global dependency at a coarse level, which suffer from a learning challenge
on global relationships and fine-grained representation at a token level. In
this paper, we introduce Multi-scale Attention Fusion into transformer
(MAFormer), which explores local aggregation and global feature extraction in a
dual-stream framework for visual recognition. We develop a simple but effective
module to explore the full potential of transformers for visual representation
by learning fine-grained and coarse-grained features at a token level and
dynamically fusing them. Our Multi-scale Attention Fusion (MAF) block consists
of: i) a local window attention branch that learns short-range interactions
within windows, aggregating fine-grained local features; ii) global feature
extraction through a novel Global Learning with Down-sampling (GLD) operation
to efficiently capture long-range context information within the whole image;
iii) a fusion module that self-explores the integration of both features via
attention. Our MAFormer achieves state-of-the-art performance on common vision
tasks. In particular, MAFormer-L achieves 85.9$\%$ Top-1 accuracy on ImageNet,
surpassing CSWin-B and LV-ViT-L by 1.7$\%$ and 0.6$\%$ respectively. On MSCOCO,
MAFormer outperforms the prior art CSWin by 1.7$\%$ mAPs on object detection
and 1.4$\%$ on instance segmentation with similar-sized parameters,
demonstrating the potential to be a general backbone network.
Related papers
- Revisiting the Integration of Convolution and Attention for Vision Backbone [59.50256661158862]
Convolutions and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones.
We propose in this work to use MSHAs and Convs in parallel textbfat different granularity levels instead.
We empirically verify the potential of the proposed integration scheme, named textitGLMix: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few semantic slots.
arXiv Detail & Related papers (2024-11-21T18:59:08Z) - Brain-Inspired Stepwise Patch Merging for Vision Transformers [6.108377966393714]
We propose a novel technique called Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to'see' better.
Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models.
arXiv Detail & Related papers (2024-09-11T03:04:46Z) - INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model [71.50973774576431]
We propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception.
We introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective.
Second, we introduce Dual-perspective Enhancement Module (DEM) to enable the mutual enhancement of global and local features.
arXiv Detail & Related papers (2024-07-23T06:02:30Z) - CTRL-F: Pairing Convolution with Transformer for Image Classification via Multi-Level Feature Cross-Attention and Representation Learning Fusion [0.0]
We present a novel lightweight hybrid network that pairs Convolution with Transformers.
We fuse the local responses acquired from the convolution path with the global responses acquired from the MFCA module.
Experiments demonstrate that our variants achieve state-of-the-art performance, whether trained from scratch on large data or even with low-data regime.
arXiv Detail & Related papers (2024-07-09T08:47:13Z) - Local-to-Global Cross-Modal Attention-Aware Fusion for HSI-X Semantic Segmentation [19.461033552684576]
We propose a Local-to-Global Cross-modal Attention-aware Fusion (LoGoCAF) framework for HSI-X classification.
LoGoCAF adopts a pixel-to-pixel two-branch semantic segmentation architecture to learn information from HSI and X modalities.
arXiv Detail & Related papers (2024-06-25T16:12:20Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Conformer: Local Features Coupling Global Representations for Visual
Recognition [72.9550481476101]
We propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning.
Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet.
arXiv Detail & Related papers (2021-05-09T10:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.