Scalable Visual Transformers with Hierarchical Pooling
- URL: http://arxiv.org/abs/2103.10619v1
- Date: Fri, 19 Mar 2021 03:55:58 GMT
- Title: Scalable Visual Transformers with Hierarchical Pooling
- Authors: Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, Jianfei Cai
- Abstract summary: We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
- Score: 61.05787583247392
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recently proposed Visual image Transformers (ViT) with pure attention
have achieved promising performance on image recognition tasks, such as image
classification. However, the routine of the current ViT model is to maintain a
full-length patch sequence during inference, which is redundant and lacks
hierarchical representation. To this end, we propose a Hierarchical Visual
Transformer (HVT) which progressively pools visual tokens to shrink the
sequence length and hence reduces the computational cost, analogous to the
feature maps downsampling in Convolutional Neural Networks (CNNs). It brings a
great benefit that we can increase the model capacity by scaling dimensions of
depth/width/resolution/patch size without introducing extra computational
complexity due to the reduced sequence length. Moreover, we empirically find
that the average pooled visual tokens contain more discriminative information
than the single class token. To demonstrate the improved scalability of our
HVT, we conduct extensive experiments on the image classification task. With
comparable FLOPs, our HVT outperforms the competitive baselines on ImageNet and
CIFAR-100 datasets.
Related papers
- VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis [8.10783983193165]
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are two dominant models for image analysis.
This paper introduces VisionGRU, a novel RNN-based architecture designed for efficient image classification.
arXiv Detail & Related papers (2024-12-24T05:27:11Z) - SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers [0.0]
Vision Transformers (ViTs) have redefined image classification by leveraging self-attention to capture complex patterns and long-range dependencies between image patches.
A key challenge for ViTs is efficiently incorporating multi-scale feature representations, which is inherent in convolutional neural networks (CNNs) through their hierarchical structure.
We propose SAG-ViT, a Scale-Aware Graph Attention ViT that integrates multi-scale feature capabilities of CNNs, representational power of ViTs, graph-attended patching to enable richer contextual representation.
arXiv Detail & Related papers (2024-11-14T13:15:27Z) - Grid Jigsaw Representation with CLIP: A New Perspective on Image Clustering [33.05984601411495]
We propose a new perspective on image clustering, the pretrain-based Grid Jigsaw Representation (pGJR)
Inspired by human jigsaw puzzle processing, we modify the traditional jigsaw learning to gain a more sequential and incremental understanding of image structure.
Our experiments demonstrate that using the pretrained model as a feature extractor can accelerate the convergence of clustering.
arXiv Detail & Related papers (2023-10-27T03:07:05Z) - Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer.
We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers.
We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [114.8051035856023]
We propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy.
Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification.
arXiv Detail & Related papers (2021-08-07T11:30:54Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Visual Transformers: Token-based Image Representation and Processing for
Computer Vision [67.55770209540306]
Visual Transformer ( VT) operates in a semantic token space, judiciously attending to different image parts based on context.
Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts.
For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
arXiv Detail & Related papers (2020-06-05T20:49:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.