Aggregating Global Features into Local Vision Transformer
- URL: http://arxiv.org/abs/2201.12903v1
- Date: Sun, 30 Jan 2022 19:57:35 GMT
- Title: Aggregating Global Features into Local Vision Transformer
- Authors: Krushi Patel, Andres M. Bur, Fengjun Li, Guanghui Wang
- Abstract summary: Local Transformer-based classification models have recently achieved promising results with relatively low computational costs.
This work investigates the outcome of applying a global attention-based module named multi-resolution overlapped attention (MOA) in the local window-based transformer after each stage.
The proposed MOA employs slightly larger and overlapped patches in the key to enable neighborhood pixel information transmission, which leads to significant performance gain.
- Score: 20.174762373916415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Local Transformer-based classification models have recently achieved
promising results with relatively low computational costs. However, the effect
of aggregating spatial global information of local Transformer-based
architecture is not clear. This work investigates the outcome of applying a
global attention-based module named multi-resolution overlapped attention (MOA)
in the local window-based transformer after each stage. The proposed MOA
employs slightly larger and overlapped patches in the key to enable
neighborhood pixel information transmission, which leads to significant
performance gain. In addition, we thoroughly investigate the effect of the
dimension of essential architecture components through extensive experiments
and discover an optimum architecture design. Extensive experimental results
CIFAR-10, CIFAR-100, and ImageNet-1K datasets demonstrate that the proposed
approach outperforms previous vision Transformers with a comparatively fewer
number of parameters.
Related papers
- HiFiSeg: High-Frequency Information Enhanced Polyp Segmentation with Global-Local Vision Transformer [5.96521715927858]
HiFiSeg is a novel network for colon polyp segmentation that enhances high-frequency information processing.
GLIM employs a parallel structure to fuse global and local information at multiple scales, effectively capturing fine-grained features.
SAM selectively integrates boundary details from low-level features with semantic information from high-level features, significantly improving the model's ability to accurately detect and segment polyps.
arXiv Detail & Related papers (2024-10-03T14:36:22Z) - Brain-Inspired Stepwise Patch Merging for Vision Transformers [6.108377966393714]
We propose a novel technique called Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to'see' better.
Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models.
arXiv Detail & Related papers (2024-09-11T03:04:46Z) - Transformer-based Context Condensation for Boosting Feature Pyramids in
Object Detection [77.50110439560152]
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF)
We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results.
In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
arXiv Detail & Related papers (2022-07-14T01:45:03Z) - SUMD: Super U-shaped Matrix Decomposition Convolutional neural network
for Image denoising [0.0]
We introduce the matrix decomposition module(MD) in the network to establish the global context feature.
Inspired by the design of multi-stage progressive restoration of U-shaped architecture, we further integrate the MD module into the multi-branches.
Our model(SUMD) can produce comparable visual quality and accuracy results with Transformer-based methods.
arXiv Detail & Related papers (2022-04-11T04:38:34Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - UniNet: Unified Architecture Search with Convolution, Transformer, and
MLP [62.401161377258234]
In this paper, we propose to jointly search the optimal combination of convolution, transformer, and COCO for building a series of all-operator network architectures.
We identify that the widely-used strided convolution or pooling based down-sampling modules become the performance bottlenecks when operators are combined to form a network.
To better tackle the global context captured by the transformer and operators, we propose two novel context-aware down-sampling modules.
arXiv Detail & Related papers (2021-10-08T11:09:40Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection.
With the global view in very shallow layers, the transformer encoder preserves more local representations.
Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.