Patch Is Not All You Need
- URL: http://arxiv.org/abs/2308.10729v1
- Date: Mon, 21 Aug 2023 13:54:00 GMT
- Title: Patch Is Not All You Need
- Authors: Changzhen Li, Jie Zhang, Yang Wei, Zhilong Ji, Jinfeng Bai, Shiguang
Shan
- Abstract summary: We propose a novel Pattern Transformer to adaptively convert images to pattern sequences for Transformer input.
We employ the Convolutional Neural Network to extract various patterns from the input image.
We have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and have achieved competitive results on ImageNet.
- Score: 57.290256181083016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers have achieved great success in computer visions,
delivering exceptional performance across various tasks. However, their
inherent reliance on sequential input enforces the manual partitioning of
images into patch sequences, which disrupts the image's inherent structural and
semantic continuity. To handle this, we propose a novel Pattern Transformer
(Patternformer) to adaptively convert images to pattern sequences for
Transformer input. Specifically, we employ the Convolutional Neural Network to
extract various patterns from the input image, with each channel representing a
unique pattern that is fed into the succeeding Transformer as a visual token.
By enabling the network to optimize these patterns, each pattern concentrates
on its local region of interest, thereby preserving its intrinsic structural
and semantic information. Only employing the vanilla ResNet and Transformer, we
have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and
have achieved competitive results on ImageNet.
Related papers
- Vision Transformers with Mixed-Resolution Tokenization [34.18534105043819]
Vision Transformer models process input images by dividing them into a spatially regular grid of equal-size patches.
We introduce a novel image tokenization scheme, replacing the standard uniform grid with a mixed-resolution sequence of tokens.
Using the Quadtree algorithm and a novel saliency scorer, we construct a patch mosaic where low-saliency areas of the image are processed in low resolution.
arXiv Detail & Related papers (2023-04-01T10:39:46Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - PPT Fusion: Pyramid Patch Transformerfor a Case Study in Image Fusion [37.993611194758195]
We propose a Patch PyramidTransformer(PPT) to address the issues of extracting semantic information from an image.
The experimental results demonstrate its superior performance against the state-of-the-art fusion approaches.
arXiv Detail & Related papers (2021-07-29T13:57:45Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z) - CPTR: Full Transformer Network for Image Captioning [15.869556479220984]
CaPtion TransformeR (CPTR) takes the sequentialized raw images as the input to Transformer.
Compared to the "CNN+Transformer" design paradigm, our model can model global context at every encoder layer from the beginning.
arXiv Detail & Related papers (2021-01-26T14:29:52Z) - An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale [112.94212299087653]
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
arXiv Detail & Related papers (2020-10-22T17:55:59Z) - Visual Transformers: Token-based Image Representation and Processing for
Computer Vision [67.55770209540306]
Visual Transformer ( VT) operates in a semantic token space, judiciously attending to different image parts based on context.
Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts.
For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
arXiv Detail & Related papers (2020-06-05T20:49:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.