Related papers: MobileInst: Video Instance Segmentation on the Mobile

MobileInst: Video Instance Segmentation on the Mobile

URL: http://arxiv.org/abs/2303.17594v2
Date: Mon, 18 Dec 2023 08:46:19 GMT
Title: MobileInst: Video Instance Segmentation on the Mobile
Authors: Renhong Zhang, Tianheng Cheng, Shusheng Yang, Haoyi Jiang, Shuai Zhang, Jiancheng Lyu, Xin Li, Xiaowen Ying, Dashan Gao, Wenyu Liu, Xinggang Wang
Abstract summary: MobileInst is a lightweight and mobile-friendly framework for video instance segmentation on mobile devices. MobileInst exploits simple yet effective kernel reuse and kernel association to track objects for video instance segmentation. We conduct experiments on COCO and YouTube-VIS datasets to demonstrate the superiority of MobileInst.
Score: 39.144494585640714
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Video instance segmentation on mobile devices is an important yet very challenging edge AI problem. It mainly suffers from (1) heavy computation and memory costs for frame-by-frame pixel-level instance perception and (2) complicated heuristics for tracking objects. To address those issues, we present MobileInst, a lightweight and mobile-friendly framework for video instance segmentation on mobile devices. Firstly, MobileInst adopts a mobile vision transformer to extract multi-level semantic features and presents an efficient query-based dual-transformer instance decoder for mask kernels and a semantic-enhanced mask decoder to generate instance segmentation per frame. Secondly, MobileInst exploits simple yet effective kernel reuse and kernel association to track objects for video instance segmentation. Further, we propose temporal query passing to enhance the tracking ability for kernels. We conduct experiments on COCO and YouTube-VIS datasets to demonstrate the superiority of MobileInst and evaluate the inference latency on one single CPU core of Snapdragon 778G Mobile Platform, without other methods of acceleration. On the COCO dataset, MobileInst achieves 31.2 mask AP and 433 ms on the mobile CPU, which reduces the latency by 50% compared to the previous SOTA. For video instance segmentation, MobileInst achieves 35.0 AP on YouTube-VIS 2019 and 30.1 AP on YouTube-VIS 2021. Code will be available to facilitate real-world applications and future research.

Related papers

EdgeTAM: On-Device Track Anything Model [65.10032957471824]
Segment Anything Model (SAM) 2 further extends its capability from image to video inputs through a memory bank mechanism. We aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. We propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost.
arXiv Detail & Related papers (2025-01-13T12:11:07Z)
PP-MobileSeg: Explore the Fast and Accurate Semantic Segmentation Model on Mobile Devices [4.784867435788648]
PP-MobileSeg is a semantic segmentation model that achieves state-of-the-art performance on mobile devices. VIM reduces model latency by only interpolating classes present in the final prediction. Experiments show that PP-MobileSeg achieves a superior tradeoff between accuracy, model size, and latency compared to other methods.
arXiv Detail & Related papers (2023-04-11T11:43:10Z)
You Only Segment Once: Towards Real-Time Panoptic Segmentation [68.91492389185744]
YOSO is a real-time panoptic segmentation framework. YOSO predicts masks via dynamic convolutions between panoptic kernels and image feature maps. YOSO achieves 46.4 PQ, 45.6 FPS on COCO; 52.5 PQ, 22.6 FPS on Cityscapes; 38.0 PQ, 35.4 FPS on ADE20K.
arXiv Detail & Related papers (2023-03-26T07:55:35Z)
InstMove: Instance Motion for Object-centric Video Segmentation [70.16915119724757]
In this work, we study the instance-level motion and present InstMove, which stands for Instance Motion for Object-centric Video. In comparison to pixel-wise motion, InstMove mainly relies on instance-level motion information that is free from image feature embeddings. With only a few lines of code, InstMove can be integrated into current SOTA methods for three different video segmentation tasks.
arXiv Detail & Related papers (2023-03-14T17:58:44Z)
SeaFormer++: Squeeze-enhanced Axial Transformer for Mobile Visual Recognition [29.522565659389183]
We introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile visual recognition. We beat both the mobilefriendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles.
arXiv Detail & Related papers (2023-01-30T18:34:16Z)
It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training [76.69480467101143]
Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline. We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework. Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
arXiv Detail & Related papers (2022-10-11T08:05:18Z)
MobileOne: An Improved One millisecond Mobile Backbone [14.041480018494394]
We analyze different metrics by deploying several mobile-friendly networks on a mobile device. We design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12. We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile.
arXiv Detail & Related papers (2022-06-08T17:55:11Z)
Separable Self-attention for Mobile Vision Transformers [34.32399598443582]
This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection.
arXiv Detail & Related papers (2022-06-06T15:31:35Z)
TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [111.8342799044698]
We present a mobile-friendly architecture named textbfToken textbfPyramid Vision Transtextbfformer (textbfTopFormer) The proposed textbfTopFormer takes Tokens from various scales as input to produce scale-aware semantic features, which are then injected into the corresponding tokens to augment the representation. On the ADE20K dataset, TopFormer achieves 5% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device.
arXiv Detail & Related papers (2022-04-12T04:51:42Z)
MobileDets: Searching for Object Detection Architectures for Mobile Accelerators [61.30355783955777]
Inverted bottleneck layers have been the predominant building blocks in state-of-the-art object detection models on mobile devices. Regular convolutions are a potent component to boost the latency-accuracy trade-off for object detection on accelerators. We obtain a family of object detection models, MobileDets, that achieve state-of-the-art results across mobile accelerators.
arXiv Detail & Related papers (2020-04-30T00:21:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.