PP-MobileSeg: Explore the Fast and Accurate Semantic Segmentation Model
on Mobile Devices
- URL: http://arxiv.org/abs/2304.05152v1
- Date: Tue, 11 Apr 2023 11:43:10 GMT
- Title: PP-MobileSeg: Explore the Fast and Accurate Semantic Segmentation Model
on Mobile Devices
- Authors: Shiyu Tang, Ting Sun, Juncai Peng, Guowei Chen, Yuying Hao, Manhui
Lin, Zhihong Xiao, Jiangbin You, Yi Liu
- Abstract summary: PP-MobileSeg is a semantic segmentation model that achieves state-of-the-art performance on mobile devices.
VIM reduces model latency by only interpolating classes present in the final prediction.
Experiments show that PP-MobileSeg achieves a superior tradeoff between accuracy, model size, and latency compared to other methods.
- Score: 4.784867435788648
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The success of transformers in computer vision has led to several attempts to
adapt them for mobile devices, but their performance remains unsatisfactory in
some real-world applications. To address this issue, we propose PP-MobileSeg, a
semantic segmentation model that achieves state-of-the-art performance on
mobile devices. PP-MobileSeg comprises three novel parts: the StrideFormer
backbone, the Aggregated Attention Module (AAM), and the Valid Interpolate
Module (VIM). The four-stage StrideFormer backbone is built with MV3 blocks and
strided SEA attention, and it is able to extract rich semantic and detailed
features with minimal parameter overhead. The AAM first filters the detailed
features through semantic feature ensemble voting and then combines them with
semantic features to enhance the semantic information. Furthermore, we proposed
VIM to upsample the downsampled feature to the resolution of the input image.
It significantly reduces model latency by only interpolating classes present in
the final prediction, which is the most significant contributor to overall
model latency. Extensive experiments show that PP-MobileSeg achieves a superior
tradeoff between accuracy, model size, and latency compared to other methods.
On the ADE20K dataset, PP-MobileSeg achieves 1.57% higher accuracy in mIoU than
SeaFormer-Base with 32.9% fewer parameters and 42.3% faster acceleration on
Qualcomm Snapdragon 855. Source codes are available at
https://github.com/PaddlePaddle/PaddleSeg/tree/release/2.8.
Related papers
- SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation [53.675725490807615]
We introduce SDPose, a new self-distillation method for improving the performance of small transformer-based models.
SDPose-T obtains 69.7% mAP with 4.4M parameters and 1.8 GFLOPs, while SDPose-S-V2 obtains 73.5% mAP on the MSCOCO validation dataset.
arXiv Detail & Related papers (2024-04-04T15:23:14Z) - Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding [81.1943823985213]
In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices.
We introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible.
Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT)
The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB.
arXiv Detail & Related papers (2023-12-27T08:52:41Z) - Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts [55.282613372420805]
We explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications.
We propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts.
We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.
arXiv Detail & Related papers (2023-09-08T14:24:10Z) - MobileInst: Video Instance Segmentation on the Mobile [39.144494585640714]
MobileInst is a lightweight and mobile-friendly framework for video instance segmentation on mobile devices.
MobileInst exploits simple yet effective kernel reuse and kernel association to track objects for video instance segmentation.
We conduct experiments on COCO and YouTube-VIS datasets to demonstrate the superiority of MobileInst.
arXiv Detail & Related papers (2023-03-30T17:59:02Z) - MobileOne: An Improved One millisecond Mobile Backbone [14.041480018494394]
We analyze different metrics by deploying several mobile-friendly networks on a mobile device.
We design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12.
We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile.
arXiv Detail & Related papers (2022-06-08T17:55:11Z) - TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [111.8342799044698]
We present a mobile-friendly architecture named textbfToken textbfPyramid Vision Transtextbfformer (textbfTopFormer)
The proposed textbfTopFormer takes Tokens from various scales as input to produce scale-aware semantic features, which are then injected into the corresponding tokens to augment the representation.
On the ADE20K dataset, TopFormer achieves 5% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device.
arXiv Detail & Related papers (2022-04-12T04:51:42Z) - YOLO-ReT: Towards High Accuracy Real-time Object Detection on Edge GPUs [14.85882314822983]
In order to map deep neural network (DNN) based object detection models to edge devices, one typically needs to compress such models significantly.
In this paper, we propose a novel edge GPU friendly module for multi-scale feature interaction.
We also propose a novel learning backbone adoption inspired by the changing translational information flow across various tasks.
arXiv Detail & Related papers (2021-10-26T14:02:59Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - MobileDets: Searching for Object Detection Architectures for Mobile
Accelerators [61.30355783955777]
Inverted bottleneck layers have been the predominant building blocks in state-of-the-art object detection models on mobile devices.
Regular convolutions are a potent component to boost the latency-accuracy trade-off for object detection on accelerators.
We obtain a family of object detection models, MobileDets, that achieve state-of-the-art results across mobile accelerators.
arXiv Detail & Related papers (2020-04-30T00:21:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.