HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs
- URL: http://arxiv.org/abs/2403.11999v1
- Date: Mon, 18 Mar 2024 17:34:29 GMT
- Title: HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs
- Authors: Ting Yao, Yehao Li, Yingwei Pan, Tao Mei,
- Abstract summary: hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks.
We present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs.
HiRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$times
- Score: 102.4965532024391
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks. Scaling up the input resolution of such hybrid backbones naturally strengthes model capacity, but inevitably suffers from heavy computational cost that scales quadratically. Instead, we present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs. HIRI-ViT is built upon the seminal idea of decomposing the typical CNN operations into two parallel CNN branches in a cost-efficient manner. One high-resolution branch directly takes primary high-resolution features as inputs, but uses less convolution operations. The other low-resolution branch first performs down-sampling and then utilizes more convolution operations over such low-resolution features. Experiments on both recognition task (ImageNet-1K dataset) and dense prediction tasks (COCO and ADE20K datasets) demonstrate the superiority of HIRI-ViT. More remarkably, under comparable computational cost ($\sim$5.0 GFLOPs), HIRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$\times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$\times$224 inputs.
Related papers
- FlowDCN: Exploring DCN-like Architectures for Fast Image Generation with Arbitrary Resolution [33.07779971446476]
We propose FlowDCN, a purely convolution-based generative model that can efficiently generate high-quality images at arbitrary resolutions.
FlowDCN achieves the state-of-the-art 4.30 sFID on $256times256$ ImageNet Benchmark and comparable resolution extrapolation results.
We believe FlowDCN offers a promising solution to scalable and flexible image synthesis.
arXiv Detail & Related papers (2024-10-30T02:48:50Z) - ViTAR: Vision Transformer with Any Resolution [80.95324692984903]
Vision Transformers experience a performance decline when processing resolutions different from those seen during training.
We introduce fuzzy positional encoding in the Vision Transformer to provide consistent positional awareness across multiple resolutions.
Our resulting model, ViTAR, demonstrates impressive adaptability, achieving 83.3% top-1 accuracy at a 1120x1120 resolution and 80.4% accuracy at a 4032x4032 resolution.
arXiv Detail & Related papers (2024-03-27T08:53:13Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Next-ViT: Next Generation Vision Transformer for Efficient Deployment in
Realistic Industrial Scenarios [19.94294348122248]
Most vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios.
We propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT.
Next-ViT dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off.
arXiv Detail & Related papers (2022-07-12T12:50:34Z) - SepViT: Separable Vision Transformer [20.403430632658946]
Vision Transformers often rely on extensive computational costs to achieve high performance, which is burdensome to deploy on resource-constrained devices.
We draw lessons from depthwise separable convolution and imitate its ideology to design an efficient Transformer backbone, i.e., Separable Vision Transformer, abbreviated as SepViT.
SepViT helps to carry out the local-global information interaction within and among the windows in sequential order via a depthwise separable self-attention.
arXiv Detail & Related papers (2022-03-29T09:20:01Z) - HRViT: Multi-Scale High-Resolution Vision Transformer [19.751569057142806]
Vision transformers (ViTs) have attracted much attention for their superior performance on computer vision tasks.
We present an efficient integration of high-resolution multi-branch architectures with vision transformers, dubbed HRViT.
The proposed HRViT achieves 50.20% mIoU on ADE20K and 83.16% mIoU on Cityscapes for semantic segmentation tasks.
arXiv Detail & Related papers (2021-11-01T19:49:52Z) - HRFormer: High-Resolution Transformer for Dense Prediction [99.6060997466614]
We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks.
We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet)
We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks.
arXiv Detail & Related papers (2021-10-18T15:37:58Z) - CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image.
There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs)
We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features.
In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z) - Improved Transformer for High-Resolution GANs [69.42469272015481]
We introduce two key ingredients to Transformer to address this challenge.
We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 31.87 and 2.95 on unconditional ImageNet $128 times 128$ and FFHQ $256 times 256$, respectively.
arXiv Detail & Related papers (2021-06-14T17:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.