One-stage Video Instance Segmentation: From Frame-in Frame-out to
Clip-in Clip-out
- URL: http://arxiv.org/abs/2203.06421v1
- Date: Sat, 12 Mar 2022 12:23:21 GMT
- Title: One-stage Video Instance Segmentation: From Frame-in Frame-out to
Clip-in Clip-out
- Authors: Minghan Li and Lei Zhang
- Abstract summary: We propose a clip-in clip-out (CiCo) framework to exploit temporal information in video clips.
CiCo strategy is free of interconditional-frame alignment, and can be easily embedded into existing FiFo based VIS approaches.
Two new one-stage VIS models achieve 37.7.3%, 35.2/35.4% and 17.2/1% mask AP.
- Score: 15.082477136581153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many video instance segmentation (VIS) methods partition a video sequence
into individual frames to detect and segment objects frame by frame. However,
such a frame-in frame-out (FiFo) pipeline is ineffective to exploit the
temporal information. Based on the fact that adjacent frames in a short clip
are highly coherent in content, we propose to extend the one-stage FiFo
framework to a clip-in clip-out (CiCo) one, which performs VIS clip by clip.
Specifically, we stack FPN features of all frames in a short video clip to
build a spatio-temporal feature cube, and replace the 2D conv layers in the
prediction heads and the mask branch with 3D conv layers, forming clip-level
prediction heads (CPH) and clip-level mask heads (CMH). Then the clip-level
masks of an instance can be generated by feeding its box-level predictions from
CPH and clip-level features from CMH into a small fully convolutional network.
A clip-level segmentation loss is proposed to ensure that the generated
instance masks are temporally coherent in the clip. The proposed CiCo strategy
is free of inter-frame alignment, and can be easily embedded into existing FiFo
based VIS approaches. To validate the generality and effectiveness of our CiCo
strategy, we apply it to two representative FiFo methods, Yolact
\cite{bolya2019yolact} and CondInst \cite{tian2020conditional}, resulting in
two new one-stage VIS models, namely CiCo-Yolact and CiCo-CondInst, which
achieve 37.1/37.3\%, 35.2/35.4\% and 17.2/18.0\% mask AP using the ResNet50
backbone, and 41.8/41.4\%, 38.0/38.9\% and 18.0/18.2\% mask AP using the Swin
Transformer tiny backbone on YouTube-VIS 2019, 2021 and OVIS valid sets,
respectively, recording new state-of-the-arts. Code and video demos of CiCo can
be found at \url{https://github.com/MinghanLi/CiCo}.
Related papers
- CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation [44.450243388665776]
We propose a simple encoder-decoder network, called CLIP-VIS, to adapt CLIP for open-vocabulary video instance segmentation.
Our CLIP-VIS adopts frozen CLIP and introduces three modules, including class-agnostic mask generation, temporal topK-enhanced matching, and weighted open-vocabulary classification.
arXiv Detail & Related papers (2024-03-19T05:27:04Z) - Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence.
We propose an efficient mask propagation framework for VSS, called SSSS.
Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Video-kMaX: A Simple Unified Approach for Online and Near-Online Video
Panoptic Segmentation [104.27219170531059]
Video Panoptics (VPS) aims to achieve comprehensive pixel-level scene understanding by segmenting all pixels and associating objects in a video.
Current solutions can be categorized into online and near-online approaches.
We propose a unified approach for online and near-online VPS.
arXiv Detail & Related papers (2023-04-10T16:17:25Z) - Per-Clip Video Object Segmentation [110.08925274049409]
Recently, memory-based approaches show promising results on semisupervised video object segmentation.
We treat video object segmentation as clip-wise mask-wise propagation.
We propose a new method tailored for the per-clip inference.
arXiv Detail & Related papers (2022-08-03T09:02:29Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - Spatial Feature Calibration and Temporal Fusion for Effective One-stage
Video Instance Segmentation [16.692219644392253]
We propose a one-stage video instance segmentation framework by spatial calibration and temporal fusion, namely STMask.
Experiments on the YouTube-VIS valid set show that the proposed STMask with ResNet-50/-101 backbone obtains 33.5 % / 36.8 % mask AP, while achieving 28.6 / 23.4 FPS on video instance segmentation.
arXiv Detail & Related papers (2021-04-06T09:26:58Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z) - GCF-Net: Gated Clip Fusion Network for Video Action Recognition [11.945392734711056]
We introduce the Gated Clip Fusion Network (GCF-Net) for video action recognition.
GCF-Net explicitly models the inter-dependencies between video clips to strengthen the receptive field of local clip descriptors.
On a large benchmark dataset (Kinetics-600), the proposed GCF-Net elevates the accuracy of existing action classifiers by 11.49%.
arXiv Detail & Related papers (2021-02-02T03:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.