Related papers: Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation

Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation

URL: http://arxiv.org/abs/2104.05606v1
Date: Tue, 6 Apr 2021 09:26:58 GMT
Title: Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation
Authors: Minghan Li, Shuai Li, Lida Li and Lei Zhang
Abstract summary: We propose a one-stage video instance segmentation framework by spatial calibration and temporal fusion, namely STMask. Experiments on the YouTube-VIS valid set show that the proposed STMask with ResNet-50/-101 backbone obtains 33.5 % / 36.8 % mask AP, while achieving 28.6 / 23.4 FPS on video instance segmentation.
Score: 16.692219644392253
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Modern one-stage video instance segmentation networks suffer from two limitations. First, convolutional features are neither aligned with anchor boxes nor with ground-truth bounding boxes, reducing the mask sensitivity to spatial location. Second, a video is directly divided into individual frames for frame-level instance segmentation, ignoring the temporal correlation between adjacent frames. To address these issues, we propose a simple yet effective one-stage video instance segmentation framework by spatial calibration and temporal fusion, namely STMask. To ensure spatial feature calibration with ground-truth bounding boxes, we first predict regressed bounding boxes around ground-truth bounding boxes, and extract features from them for frame-level instance segmentation. To further explore temporal correlation among video frames, we aggregate a temporal fusion module to infer instance masks from each frame to its adjacent frames, which helps our framework to handle challenging videos such as motion blur, partial occlusion and unusual object-to-camera poses. Experiments on the YouTube-VIS valid set show that the proposed STMask with ResNet-50/-101 backbone obtains 33.5 % / 36.8 % mask AP, while achieving 28.6 / 23.4 FPS on video instance segmentation. The code is released online https://github.com/MinghanLi/STMask.

Related papers

You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query. Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames. We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention [23.72098615213679]
Video instance segmentation aims at predicting object segmentation masks for each frame, as well as associating the instances across multiple frames. Recent end-to-end video instance segmentation methods are capable of performing object segmentation and instance association together in a direct parallel sequence decoding/prediction framework. We propose a consistent end-to-end video instance segmentation framework with Inter-Frame Recurrent Attention to model both the temporal instance consistency for adjacent frames and the global temporal context.
arXiv Detail & Related papers (2022-06-14T17:22:55Z)
Tag-Based Attention Guided Bottom-Up Approach for Video Instance Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence. We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach. Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z)
One-stage Video Instance Segmentation: From Frame-in Frame-out to Clip-in Clip-out [15.082477136581153]
We propose a clip-in clip-out (CiCo) framework to exploit temporal information in video clips. CiCo strategy is free of interconditional-frame alignment, and can be easily embedded into existing FiFo based VIS approaches. Two new one-stage VIS models achieve 37.7.3%, 35.2/35.4% and 17.2/1% mask AP.
arXiv Detail & Related papers (2022-03-12T12:23:21Z)
Hybrid Tracker with Pixel and Instance for Video Panoptic Segmentation [50.62685357414904]
Video Panoptic coefficient (VPS) aims to generate coherent panoptic segmentation and track the identities of all pixels across video frames. We present HybridTracker, a lightweight and joint tracking model attempting to eliminate the limitations of the single tracker. Comprehensive experiments show that HybridTracker achieves superior performance than state-of-the-art methods on Cityscapes-VPS and VIPER datasets.
arXiv Detail & Related papers (2022-03-02T16:21:55Z)
Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation [51.68840525174265]
Video instance segmentation aims to detect, segment, and track objects in a video. Current approaches extend image-level segmentation algorithms to the temporal domain. We propose a video instance segmentation method that alleviates the problem due to missing detections.
arXiv Detail & Related papers (2021-11-15T04:15:57Z)
Video Instance Segmentation by Instance Flow Assembly [23.001856276175506]
Bottom-up methods dealing with box-free features could offer accurate spacial correlations across frames. We propose our framework equipped with a temporal context fusion module to better encode inter-frame correlations. Experiments demonstrate that the proposed method outperforms the state-of-the-art online methods (taking image-level input) on the challenging Youtube-VIS dataset.
arXiv Detail & Related papers (2021-10-20T14:49:28Z)
Contextual Guided Segmentation Framework for Semi-supervised Video Instance Segmentation [20.174393465900156]
We propose Contextual Guided (CGS) framework for video instance segmentation in three passes. In the first pass, i.e., preview segmentation, we propose Instance Re-Identification Flow to estimate main properties of each instance. In the second pass, i.e., contextual segmentation, we introduce multiple contextual segmentation schemes. Experiments conducted on the DAVIS Test-Challenge dataset demonstrate the effectiveness of our proposed framework.
arXiv Detail & Related papers (2021-06-07T04:16:50Z)
Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency [28.352140544936198]
Weakly supervised instance segmentation reduces the cost of annotations required to train models. We show that these issues can be better addressed by training with weakly labeled videos instead of images. We are the first to explore the use of these video signals to tackle weakly supervised instance segmentation.
arXiv Detail & Related papers (2021-03-23T23:20:46Z)
Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos. We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks. The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)
SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation [149.242230059447]
We propose a fast single-stage instance segmentation method called SipMask. It preserves instance-specific spatial information by separating mask prediction of an instance to different sub-regions of a detected bounding-box. In terms of real-time capabilities, SipMask outperforms YOLACT with an absolute gain of 3.0% (mask AP) under similar settings.
arXiv Detail & Related papers (2020-07-29T12:21:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.