ResT: An Efficient Transformer for Visual Recognition
- URL: http://arxiv.org/abs/2105.13677v2
- Date: Mon, 31 May 2021 13:16:31 GMT
- Title: ResT: An Efficient Transformer for Visual Recognition
- Authors: Qinglong Zhang and Yubin Yang
- Abstract summary: This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition.
We show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones.
- Score: 5.807423409327807
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an efficient multi-scale vision Transformer, called ResT,
that capably served as a general-purpose backbone for image recognition. Unlike
existing Transformer methods, which employ standard Transformer blocks to
tackle raw images with a fixed resolution, our ResT have several advantages:
(1) A memory-efficient multi-head self-attention is built, which compresses the
memory by a simple depth-wise convolution, and projects the interaction across
the attention-heads dimension while keeping the diversity ability of
multi-heads; (2) Position encoding is constructed as spatial attention, which
is more flexible and can tackle with input images of arbitrary size without
interpolation or fine-tune; (3) Instead of the straightforward tokenization at
the beginning of each stage, we design the patch embedding as a stack of
overlapping convolution operation with stride on the 2D-reshaped token map. We
comprehensively validate ResT on image classification and downstream tasks.
Experimental results show that the proposed ResT can outperform the recently
state-of-the-art backbones by a large margin, demonstrating the potential of
ResT as strong backbones. The code and models will be made publicly available
at https://github.com/wofmanaf/ResT.
Related papers
- A Contrastive Learning Scheme with Transformer Innate Patches [4.588028371034407]
We present Contrastive Transformer, a contrastive learning scheme using the Transformer innate patches.
The scheme performs supervised patch-level contrastive learning, selecting the patches based on the ground truth mask.
The scheme applies to all vision-transformer architectures, is easy to implement, and introduces minimal additional memory footprint.
arXiv Detail & Related papers (2023-03-26T20:19:28Z) - Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration.
ART presents both dense and sparse attention modules in the network.
We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z) - HIPA: Hierarchical Patch Transformer for Single Image Super Resolution [62.7081074931892]
This paper presents HIPA, a novel Transformer architecture that progressively recovers the high resolution image using a hierarchical patch partition.
We build a cascaded model that processes an input image in multiple stages, where we start with tokens with small patch sizes and gradually merge to the full resolution.
Such a hierarchical patch mechanism not only explicitly enables feature aggregation at multiple resolutions but also adaptively learns patch-aware features for different image regions.
arXiv Detail & Related papers (2022-03-19T05:09:34Z) - Restormer: Efficient Transformer for High-Resolution Image Restoration [118.9617735769827]
convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data.
Transformers have shown significant performance gains on natural language and high-level vision tasks.
Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks.
arXiv Detail & Related papers (2021-11-18T18:59:10Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - PPT Fusion: Pyramid Patch Transformerfor a Case Study in Image Fusion [37.993611194758195]
We propose a Patch PyramidTransformer(PPT) to address the issues of extracting semantic information from an image.
The experimental results demonstrate its superior performance against the state-of-the-art fusion approaches.
arXiv Detail & Related papers (2021-07-29T13:57:45Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z) - TransReID: Transformer-based Object Re-Identification [20.02035310635418]
Vision Transformer (ViT) is a pure transformer-based model for the object re-identification (ReID) task.
With several adaptations, a strong baseline ViT-BoT is constructed with ViT as backbone.
We propose a pure-transformer framework dubbed as TransReID, which is the first work to use a pure Transformer for ReID research.
arXiv Detail & Related papers (2021-02-08T17:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.