HRViT: Multi-Scale High-Resolution Vision Transformer
- URL: http://arxiv.org/abs/2111.01236v1
- Date: Mon, 1 Nov 2021 19:49:52 GMT
- Title: HRViT: Multi-Scale High-Resolution Vision Transformer
- Authors: Jiaqi Gu, Hyoukjun Kwon, Dilin Wang, Wei Ye, Meng Li, Yu-Hsin Chen,
Liangzhen Lai, Vikas Chandra, David Z. Pan
- Abstract summary: Vision transformers (ViTs) have attracted much attention for their superior performance on computer vision tasks.
We present an efficient integration of high-resolution multi-branch architectures with vision transformers, dubbed HRViT.
The proposed HRViT achieves 50.20% mIoU on ADE20K and 83.16% mIoU on Cityscapes for semantic segmentation tasks.
- Score: 19.751569057142806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers (ViTs) have attracted much attention for their superior
performance on computer vision tasks. To address their limitations of
single-scale low-resolution representations, prior work adapts ViTs to
high-resolution dense prediction tasks with hierarchical architectures to
generate pyramid features. However, multi-scale representation learning is
still under-explored on ViTs, given their classification-like sequential
topology. To enhance ViTs with more capability to learn semantically-rich and
spatially-precise multi-scale representations, in this work, we present an
efficient integration of high-resolution multi-branch architectures with vision
transformers, dubbed HRViT, pushing the Pareto front of dense prediction tasks
to a new level. We explore heterogeneous branch design, reduce the redundancy
in linear layers, and augment the model nonlinearity to balance the model
performance and hardware efficiency. The proposed HRViT achieves 50.20% mIoU on
ADE20K and 83.16% mIoU on Cityscapes for semantic segmentation tasks,
surpassing state-of-the-art MiT and CSWin with an average of +1.78 mIoU
improvement, 28% parameter reduction, and 21% FLOPs reduction, demonstrating
the potential of HRViT as strong vision backbones.
Related papers
- LaVin-DiT: Large Vision Diffusion Transformer [99.98106406059333]
LaVin-DiT is a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework.
We introduce key innovations to optimize generative performance for vision tasks.
The model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks.
arXiv Detail & Related papers (2024-11-18T12:05:27Z) - Navigating Efficiency in MobileViT through Gaussian Process on Global Architecture Factors [11.030156344387732]
We leverage the Gaussian process to explore the relationship between performance and global architecture factors of MobileViT.
We present design principles twisting magic 4D cube of the global architecture factors that minimize model sizes and computational costs with higher model accuracy.
Experiment results show that our formula significantly outperforms CNNs and mobile ViTs across diversified datasets.
arXiv Detail & Related papers (2024-06-07T10:41:24Z) - HSViT: Horizontally Scalable Vision Transformer [16.46308352393693]
Vision Transformer (ViT) requires pre-training on large-scale datasets to perform well.
This paper introduces a novel horizontally scalable vision transformer (HSViT) scheme.
HSViT achieves up to 10% higher top-1 accuracy than state-of-the-art schemes on small datasets.
arXiv Detail & Related papers (2024-04-08T04:53:29Z) - HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs [102.4965532024391]
hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks.
We present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs.
HiRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$times
arXiv Detail & Related papers (2024-03-18T17:34:29Z) - ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions [4.554319452683839]
Vision Transformer (ViT) has achieved significant success in computer vision, but does not perform well in dense prediction tasks.
We present a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction, named ViT-CoMer.
We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features.
arXiv Detail & Related papers (2024-03-12T07:59:41Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - Hierarchical Side-Tuning for Vision Transformers [33.536948382414316]
Fine-tuning pre-trained Vision Transformers (ViTs) has showcased significant promise in enhancing visual recognition tasks.
PETL has shown potential for achieving high performance with fewer parameter updates compared to full fine-tuning.
This paper introduces Hierarchical Side-Tuning (HST), an innovative PETL method facilitating the transfer of ViT models to diverse downstream tasks.
arXiv Detail & Related papers (2023-10-09T04:16:35Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous
Structured Pruning for Vision Transformer [76.2625311630021]
Vision transformers (ViTs) have shown very impressive empirical performance in various computer vision tasks.
To mitigate this challenging problem, structured pruning is a promising solution to compress model size and enable practical efficiency.
We propose GOHSP, a unified framework of Graph and Optimization-based Structured Pruning for ViT models.
arXiv Detail & Related papers (2023-01-13T00:40:24Z) - Grafting Vision Transformers [42.71480918208436]
Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks.
GrafT considers global dependencies and multi-scale information throughout the network.
It has the flexibility of branching out at arbitrary depths and shares most of the parameters and computations of the backbone.
arXiv Detail & Related papers (2022-10-28T07:07:13Z) - Efficient Self-supervised Vision Transformers for Representation
Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity.
We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies.
Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.