Related papers: TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

URL: http://arxiv.org/abs/2204.05525v1
Date: Tue, 12 Apr 2022 04:51:42 GMT
Title: TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation
Authors: Wenqiang Zhang, Zilong Huang, Guozhong Luo, Tao Chen, Xinggang Wang, Wenyu Liu, Gang Yu, Chunhua Shen
Abstract summary: We present a mobile-friendly architecture named textbfToken textbfPyramid Vision Transtextbfformer (textbfTopFormer) The proposed textbfTopFormer takes Tokens from various scales as input to produce scale-aware semantic features, which are then injected into the corresponding tokens to augment the representation. On the ADE20K dataset, TopFormer achieves 5% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device.
Score: 111.8342799044698
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction tasks such as semantic segmentation on mobile devices. In this paper, we present a mobile-friendly architecture named \textbf{To}ken \textbf{P}yramid Vision Trans\textbf{former} (\textbf{TopFormer}). The proposed \textbf{TopFormer} takes Tokens from various scales as input to produce scale-aware semantic features, which are then injected into the corresponding tokens to augment the representation. Experimental results demonstrate that our method significantly outperforms CNN- and ViT-based networks across several semantic segmentation datasets and achieves a good trade-off between accuracy and latency. On the ADE20K dataset, TopFormer achieves 5\% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device. Furthermore, the tiny version of TopFormer achieves real-time inference on an ARM-based mobile device with competitive results. The code and models are available at: https://github.com/hustvl/TopFormer

Related papers

CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [59.193626019860226]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability. We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers. We show that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones.
arXiv Detail & Related papers (2024-08-07T11:33:46Z)
Vision Transformer with Sparse Scan Prior [57.37893387775829]
Inspired by the human eye's sparse scanning mechanism, we propose a textbfSparse textbfScan textbfSelf-textbfAttention mechanism. This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors. Building on $rmS3rmA$, we introduce the textbfSparse textbfScan textbfVision
arXiv Detail & Related papers (2024-05-22T04:34:36Z)
AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. We propose to apply adaptive resolution for different regions in the image according to their importance. We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z)
PP-MobileSeg: Explore the Fast and Accurate Semantic Segmentation Model on Mobile Devices [4.784867435788648]
PP-MobileSeg is a semantic segmentation model that achieves state-of-the-art performance on mobile devices. VIM reduces model latency by only interpolating classes present in the final prediction. Experiments show that PP-MobileSeg achieves a superior tradeoff between accuracy, model size, and latency compared to other methods.
arXiv Detail & Related papers (2023-04-11T11:43:10Z)
SeaFormer++: Squeeze-enhanced Axial Transformer for Mobile Visual Recognition [29.522565659389183]
We introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile visual recognition. We beat both the mobilefriendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles.
arXiv Detail & Related papers (2023-01-30T18:34:16Z)
RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer [63.25665813125223]
We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation. It achieves better trade-off between performance and efficiency than CNN-based models. Experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer.
arXiv Detail & Related papers (2022-10-13T16:03:53Z)
MobileOne: An Improved One millisecond Mobile Backbone [14.041480018494394]
We analyze different metrics by deploying several mobile-friendly networks on a mobile device. We design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12. We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile.
arXiv Detail & Related papers (2022-06-08T17:55:11Z)
Separable Self-attention for Mobile Vision Transformers [34.32399598443582]
This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection.
arXiv Detail & Related papers (2022-06-06T15:31:35Z)
MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation. It simultaneously learns global semantic information and local spatial-detailed features. Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z)
MobileDets: Searching for Object Detection Architectures for Mobile Accelerators [61.30355783955777]
Inverted bottleneck layers have been the predominant building blocks in state-of-the-art object detection models on mobile devices. Regular convolutions are a potent component to boost the latency-accuracy trade-off for object detection on accelerators. We obtain a family of object detection models, MobileDets, that achieve state-of-the-art results across mobile accelerators.
arXiv Detail & Related papers (2020-04-30T00:21:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.