TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation
- URL: http://arxiv.org/abs/2204.05525v1
- Date: Tue, 12 Apr 2022 04:51:42 GMT
- Title: TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation
- Authors: Wenqiang Zhang, Zilong Huang, Guozhong Luo, Tao Chen, Xinggang Wang,
Wenyu Liu, Gang Yu, Chunhua Shen
- Abstract summary: We present a mobile-friendly architecture named textbfToken textbfPyramid Vision Transtextbfformer (textbfTopFormer)
The proposed textbfTopFormer takes Tokens from various scales as input to produce scale-aware semantic features, which are then injected into the corresponding tokens to augment the representation.
On the ADE20K dataset, TopFormer achieves 5% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device.
- Score: 111.8342799044698
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Although vision transformers (ViTs) have achieved great success in computer
vision, the heavy computational cost hampers their applications to dense
prediction tasks such as semantic segmentation on mobile devices. In this
paper, we present a mobile-friendly architecture named \textbf{To}ken
\textbf{P}yramid Vision Trans\textbf{former} (\textbf{TopFormer}). The proposed
\textbf{TopFormer} takes Tokens from various scales as input to produce
scale-aware semantic features, which are then injected into the corresponding
tokens to augment the representation. Experimental results demonstrate that our
method significantly outperforms CNN- and ViT-based networks across several
semantic segmentation datasets and achieves a good trade-off between accuracy
and latency. On the ADE20K dataset, TopFormer achieves 5\% higher accuracy in
mIoU than MobileNetV3 with lower latency on an ARM-based mobile device.
Furthermore, the tiny version of TopFormer achieves real-time inference on an
ARM-based mobile device with competitive results. The code and models are
available at: https://github.com/hustvl/TopFormer
Related papers
- CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [59.193626019860226]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability.
We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers.
We show that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones.
arXiv Detail & Related papers (2024-08-07T11:33:46Z) - Vision Transformer with Sparse Scan Prior [57.37893387775829]
Inspired by the human eye's sparse scanning mechanism, we propose a textbfSparse textbfScan textbfSelf-textbfAttention mechanism.
This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors.
Building on $rmS3rmA$, we introduce the textbfSparse textbfScan textbfVision
arXiv Detail & Related papers (2024-05-22T04:34:36Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - PP-MobileSeg: Explore the Fast and Accurate Semantic Segmentation Model
on Mobile Devices [4.784867435788648]
PP-MobileSeg is a semantic segmentation model that achieves state-of-the-art performance on mobile devices.
VIM reduces model latency by only interpolating classes present in the final prediction.
Experiments show that PP-MobileSeg achieves a superior tradeoff between accuracy, model size, and latency compared to other methods.
arXiv Detail & Related papers (2023-04-11T11:43:10Z) - SeaFormer++: Squeeze-enhanced Axial Transformer for Mobile Visual Recognition [29.522565659389183]
We introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile visual recognition.
We beat both the mobilefriendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles.
arXiv Detail & Related papers (2023-01-30T18:34:16Z) - RTFormer: Efficient Design for Real-Time Semantic Segmentation with
Transformer [63.25665813125223]
We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation.
It achieves better trade-off between performance and efficiency than CNN-based models.
Experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer.
arXiv Detail & Related papers (2022-10-13T16:03:53Z) - MobileOne: An Improved One millisecond Mobile Backbone [14.041480018494394]
We analyze different metrics by deploying several mobile-friendly networks on a mobile device.
We design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12.
We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile.
arXiv Detail & Related papers (2022-06-08T17:55:11Z) - Separable Self-attention for Mobile Vision Transformers [34.32399598443582]
This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$.
The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection.
arXiv Detail & Related papers (2022-06-06T15:31:35Z) - MobileDets: Searching for Object Detection Architectures for Mobile
Accelerators [61.30355783955777]
Inverted bottleneck layers have been the predominant building blocks in state-of-the-art object detection models on mobile devices.
Regular convolutions are a potent component to boost the latency-accuracy trade-off for object detection on accelerators.
We obtain a family of object detection models, MobileDets, that achieve state-of-the-art results across mobile accelerators.
arXiv Detail & Related papers (2020-04-30T00:21:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.