Related papers: Scalable Visual Transformers with Hierarchical Pooling

Scalable Visual Transformers with Hierarchical Pooling

URL: http://arxiv.org/abs/2103.10619v1
Date: Fri, 19 Mar 2021 03:55:58 GMT
Title: Scalable Visual Transformers with Hierarchical Pooling
Authors: Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, Jianfei Cai
Abstract summary: We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length. It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity. Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
Score: 61.05787583247392
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on image recognition tasks, such as image classification. However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation. To this end, we propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs). It brings a great benefit that we can increase the model capacity by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity due to the reduced sequence length. Moreover, we empirically find that the average pooled visual tokens contain more discriminative information than the single class token. To demonstrate the improved scalability of our HVT, we conduct extensive experiments on the image classification task. With comparable FLOPs, our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.

Related papers

VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis [8.10783983193165]
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are two dominant models for image analysis. This paper introduces VisionGRU, a novel RNN-based architecture designed for efficient image classification.
arXiv Detail & Related papers (2024-12-24T05:27:11Z)
SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers [0.0]
We introduce the Scale-Aware Graph Attention Vision Transformer (SAG-ViT), a novel framework that addresses this challenge by integrating multi-scale features. Using EfficientNet as a backbone, the model extracts multi-scale feature maps, which are divided into patches to preserve semantic information. The SAG-ViT is evaluated on benchmark datasets, demonstrating its effectiveness in enhancing image classification performance.
arXiv Detail & Related papers (2024-11-14T13:15:27Z)
Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets [11.95214938154427]
Vision Transformer (ViT) captures global information by dividing images into patches. ViT lacks inductive bias during image or video dataset training. We present a lightweight Depth-Wise Convolution module as a shortcut in ViT models.
arXiv Detail & Related papers (2024-07-28T04:23:40Z)
Grid Jigsaw Representation with CLIP: A New Perspective on Image Clustering [33.05984601411495]
We propose a new perspective on image clustering, the pretrain-based Grid Jigsaw Representation (pGJR) Inspired by human jigsaw puzzle processing, we modify the traditional jigsaw learning to gain a more sequential and incremental understanding of image structure. Our experiments demonstrate that using the pretrained model as a feature extractor can accelerate the convergence of clustering.
arXiv Detail & Related papers (2023-10-27T03:07:05Z)
Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer. We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [114.8051035856023]
We propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy. Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification.
arXiv Detail & Related papers (2021-08-07T11:30:54Z)
Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)
Visual Transformers: Token-based Image Representation and Processing for Computer Vision [67.55770209540306]
Visual Transformer ( VT) operates in a semantic token space, judiciously attending to different image parts based on context. Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts. For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
arXiv Detail & Related papers (2020-06-05T20:49:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.