STC: Spatio-Temporal Contrastive Learning for Video Instance
Segmentation
- URL: http://arxiv.org/abs/2202.03747v1
- Date: Tue, 8 Feb 2022 09:34:26 GMT
- Title: STC: Spatio-Temporal Contrastive Learning for Video Instance
Segmentation
- Authors: Zhengkai Jiang, Zhangxuan Gu, Jinlong Peng, Hang Zhou, Liang Liu,
Yabiao Wang, Ying Tai, Chengjie Wang, Liqing Zhang
- Abstract summary: Video Instance (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video.
Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions.
We present a simple and efficient single-stage VIS framework based on the instance segmentation method ConInst.
- Score: 47.28515170195206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Instance Segmentation (VIS) is a task that simultaneously requires
classification, segmentation, and instance association in a video. Recent VIS
approaches rely on sophisticated pipelines to achieve this goal, including
RoI-related operations or 3D convolutions. In contrast, we present a simple and
efficient single-stage VIS framework based on the instance segmentation method
CondInst by adding an extra tracking head. To improve instance association
accuracy, a novel bi-directional spatio-temporal contrastive learning strategy
for tracking embedding across frames is proposed. Moreover, an instance-wise
temporal consistency scheme is utilized to produce temporally coherent results.
Experiments conducted on the YouTube-VIS-2019, YouTube-VIS-2021, and OVIS-2021
datasets validate the effectiveness and efficiency of the proposed method. We
hope the proposed framework can serve as a simple and strong alternative for
many other instance-level video association tasks. Code will be made available.
Related papers
- DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - DVIS: Decoupled Video Instance Segmentation Framework [15.571072365208872]
Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing.
Existing methods often underperform on complex and long videos in real world, primarily due to two factors.
We propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement.
arXiv Detail & Related papers (2023-06-06T05:24:15Z) - InsPro: Propagating Instance Query and Proposal for Online Video
Instance Segmentation [41.85216306978024]
Video instance segmentation (VIS) aims at segmenting and tracking objects in videos.
Prior methods generate frame-level or clip-level object instances first and then associate them by either additional tracking heads or complex instance matching algorithms.
In this paper, we design a simple, fast and yet effective query-based framework for online VIS.
arXiv Detail & Related papers (2023-01-05T02:41:20Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - Crossover Learning for Fast Online Video Instance Segmentation [53.5613957875507]
We present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames.
To our knowledge, CrossVIS achieves state-of-the-art performance among all online VIS methods and shows a decent trade-off between latency and accuracy.
arXiv Detail & Related papers (2021-04-13T06:47:40Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.