Mask Propagation for Efficient Video Semantic Segmentation
- URL: http://arxiv.org/abs/2310.18954v1
- Date: Sun, 29 Oct 2023 09:55:28 GMT
- Title: Mask Propagation for Efficient Video Semantic Segmentation
- Authors: Yuetian Weng, Mingfei Han, Haoyu He, Mingjie Li, Lina Yao, Xiaojun
Chang, Bohan Zhuang
- Abstract summary: Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence.
We propose an efficient mask propagation framework for VSS, called SSSS.
Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
- Score: 63.09523058489429
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Semantic Segmentation (VSS) involves assigning a semantic label to each
pixel in a video sequence. Prior work in this field has demonstrated promising
results by extending image semantic segmentation models to exploit temporal
relationships across video frames; however, these approaches often incur
significant computational costs. In this paper, we propose an efficient mask
propagation framework for VSS, called MPVSS. Our approach first employs a
strong query-based image segmentor on sparse key frames to generate accurate
binary masks and class predictions. We then design a flow estimation module
utilizing the learned queries to generate a set of segment-aware flow maps,
each associated with a mask prediction from the key frame. Finally, the
mask-flow pairs are warped to serve as the mask predictions for the non-key
frames. By reusing predictions from key frames, we circumvent the need to
process a large volume of video frames individually with resource-intensive
segmentors, alleviating temporal redundancy and significantly reducing
computational costs. Extensive experiments on VSPW and Cityscapes demonstrate
that our mask propagation framework achieves SOTA accuracy and efficiency
trade-offs. For instance, our best model with Swin-L backbone outperforms the
SOTA MRCFA using MiT-B5 by 4.0% mIoU, requiring only 26% FLOPs on the VSPW
dataset. Moreover, our framework reduces up to 4x FLOPs compared to the
per-frame Mask2Former baseline with only up to 2% mIoU degradation on the
Cityscapes validation set. Code is available at
https://github.com/ziplab/MPVSS.
Related papers
- Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - DFormer: Diffusion-guided Transformer for Universal Image Segmentation [86.73405604947459]
The proposed DFormer views universal image segmentation task as a denoising process using a diffusion model.
At inference, our DFormer directly predicts the masks and corresponding categories from a set of randomly-generated masks.
Our DFormer outperforms the recent diffusion-based panoptic segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val 2017 set.
arXiv Detail & Related papers (2023-06-06T06:33:32Z) - One-Shot Video Inpainting [5.7120338754738835]
We propose a unified pipeline for one-shot video inpainting (OSVI)
By jointly learning mask prediction and video completion in an end-to-end manner, the results can be optimal for the entire task.
Our method is more reliable because the predicted masks can be used as the network's internal guidance.
arXiv Detail & Related papers (2023-02-28T07:30:36Z) - Masked Contrastive Pre-Training for Efficient Video-Text Retrieval [37.05164804180039]
We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC)
Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model.
Coupling these designs enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate pre-training (by 3x), and improve performance.
arXiv Detail & Related papers (2022-12-02T05:44:23Z) - ConvMAE: Masked Convolution Meets Masked Autoencoders [65.15953258300958]
Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT.
Our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.
Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base.
arXiv Detail & Related papers (2022-05-08T15:12:19Z) - Efficient Video Object Segmentation with Compressed Video [36.192735485675286]
We propose an efficient framework for semi-supervised video object segmentation by exploiting the temporal redundancy of the video.
Our method performs inference on selected vectors and makes predictions for other frames via propagation based on motion and residuals from the compressed video bitstream.
With STM with top-k filtering as our base model, we achieved highly competitive results on DAVIS16 and YouTube-VOS with substantial speedups of up to 4.9X with little loss in accuracy.
arXiv Detail & Related papers (2021-07-26T12:57:04Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z) - SipMask: Spatial Information Preservation for Fast Image and Video
Instance Segmentation [149.242230059447]
We propose a fast single-stage instance segmentation method called SipMask.
It preserves instance-specific spatial information by separating mask prediction of an instance to different sub-regions of a detected bounding-box.
In terms of real-time capabilities, SipMask outperforms YOLACT with an absolute gain of 3.0% (mask AP) under similar settings.
arXiv Detail & Related papers (2020-07-29T12:21:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.