Related papers: Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions

Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions

URL: http://arxiv.org/abs/2203.00960v1
Date: Wed, 2 Mar 2022 09:14:28 GMT
Title: Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions
Authors: Rui-Yang Ju, Ting-Yu Lin, Jen-Shiun Chiang, Jia-Hao Jian, Yu-Shian Lin, and Liu-Rui-Yi Huang
Abstract summary: This work is based on Vision Transformer, combined with the pyramid architecture, using Split-merge-transform to propose the group encoder and name the network architecture Aggregated Pyramid Vision Transformer (APVT) We perform image classification tasks on the CIFAR-10 dataset and object detection tasks on the COCO 2017 dataset.
Score: 1.1032962642000486
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the achievements of Transformer in the field of natural language processing, the encoder-decoder and the attention mechanism in Transformer have been applied to computer vision. Recently, in multiple tasks of computer vision (image classification, object detection, semantic segmentation, etc.), state-of-the-art convolutional neural networks have introduced some concepts of Transformer. This proves that Transformer has a good prospect in the field of image recognition. After Vision Transformer was proposed, more and more works began to use self-attention to completely replace the convolutional layer. This work is based on Vision Transformer, combined with the pyramid architecture, using Split-transform-merge to propose the group encoder and name the network architecture Aggregated Pyramid Vision Transformer (APVT). We perform image classification tasks on the CIFAR-10 dataset and object detection tasks on the COCO 2017 dataset. Compared with other network architectures that use Transformer as the backbone, APVT has excellent results while reducing the computational cost. We hope this improved strategy can provide a reference for future Transformer research in computer vision.

Related papers

MDS-ViTNet: Improving saliency prediction for Eye-Tracking with Vision Transformer [0.0]
We present MDS-ViTNet (Multi Decoder Saliency by Vision Transformer Network) for enhancing visual saliency prediction or eye-tracking. This approach holds significant potential for diverse fields, including marketing, medicine, robotics, and retail. Our trained model achieves state-of-the-art results across several benchmarks.
arXiv Detail & Related papers (2024-05-29T20:28:04Z)
Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions. When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z)
PPT Fusion: Pyramid Patch Transformerfor a Case Study in Image Fusion [37.993611194758195]
We propose a Patch PyramidTransformer(PPT) to address the issues of extracting semantic information from an image. The experimental results demonstrate its superior performance against the state-of-the-art fusion approaches.
arXiv Detail & Related papers (2021-07-29T13:57:45Z)
Fully Transformer Networks for Semantic ImageSegmentation [26.037770622551882]
We explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN) We propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT) Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation.
arXiv Detail & Related papers (2021-06-08T05:15:28Z)
Transformer-Based Deep Image Matching for Generalizable Person Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z)
Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation. tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z)
Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence. Transformers require minimal inductive biases for their design and are naturally suited as set-functions. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
Transformer for Image Quality Assessment [14.975436239088312]
We propose an architecture of using a shallow Transformer encoder on the top of a feature map extracted by convolution neural networks (CNN) Adaptive positional embedding is employed in the Transformer encoder to handle images with arbitrary resolutions. We have found that the proposed TRIQ architecture achieves outstanding performance.
arXiv Detail & Related papers (2020-12-30T18:43:11Z)
A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [112.94212299087653]
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
arXiv Detail & Related papers (2020-10-22T17:55:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.