DVIS: Decoupled Video Instance Segmentation Framework
- URL: http://arxiv.org/abs/2306.03413v3
- Date: Fri, 14 Jul 2023 08:46:08 GMT
- Title: DVIS: Decoupled Video Instance Segmentation Framework
- Authors: Tao Zhang, Xingye Tian, Yu Wu, Shunping Ji, Xuebo Wang, Yuan Zhang,
Pengfei Wan
- Abstract summary: Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing.
Existing methods often underperform on complex and long videos in real world, primarily due to two factors.
We propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement.
- Score: 15.571072365208872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video instance segmentation (VIS) is a critical task with diverse
applications, including autonomous driving and video editing. Existing methods
often underperform on complex and long videos in real world, primarily due to
two factors. Firstly, offline methods are limited by the tightly-coupled
modeling paradigm, which treats all frames equally and disregards the
interdependencies between adjacent frames. Consequently, this leads to the
introduction of excessive noise during long-term temporal alignment. Secondly,
online methods suffer from inadequate utilization of temporal information. To
tackle these challenges, we propose a decoupling strategy for VIS by dividing
it into three independent sub-tasks: segmentation, tracking, and refinement.
The efficacy of the decoupling strategy relies on two crucial elements: 1)
attaining precise long-term alignment outcomes via frame-by-frame association
during tracking, and 2) the effective utilization of temporal information
predicated on the aforementioned accurate alignment outcomes during refinement.
We introduce a novel referring tracker and temporal refiner to construct the
\textbf{D}ecoupled \textbf{VIS} framework (\textbf{DVIS}). DVIS achieves new
SOTA performance in both VIS and VPS, surpassing the current SOTA methods by
7.3 AP and 9.6 VPQ on the OVIS and VIPSeg datasets, which are the most
challenging and realistic benchmarks. Moreover, thanks to the decoupling
strategy, the referring tracker and temporal refiner are super light-weight
(only 1.69\% of the segmenter FLOPs), allowing for efficient training and
inference on a single GPU with 11G memory. The code is available at
\href{https://github.com/zhang-tao-whu/DVIS}{https://github.com/zhang-tao-whu/DVIS}.
Related papers
- DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - TCOVIS: Temporally Consistent Online Video Instance Segmentation [98.29026693059444]
We propose a novel online method for video instance segmentation called TCOVIS.
The core of our method consists of a global instance assignment strategy and a video-temporal enhancement module.
We evaluate our method on four VIS benchmarks and achieve state-of-the-art performance on all benchmarks without bells-and-whistles.
arXiv Detail & Related papers (2023-09-21T07:59:15Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos.
Most clip-level methods are neither end-to-end learnable nor real-time.
This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z) - STC: Spatio-Temporal Contrastive Learning for Video Instance
Segmentation [47.28515170195206]
Video Instance (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video.
Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions.
We present a simple and efficient single-stage VIS framework based on the instance segmentation method ConInst.
arXiv Detail & Related papers (2022-02-08T09:34:26Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - 1st Place Solution for YouTubeVOS Challenge 2021:Video Instance
Segmentation [0.39146761527401414]
Video Instance (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously.
We propose two modules, named Temporally Correlated Instance (TCIS) and Bidirectional Tracking (BiTrack)
By combining these techniques with a bag of tricks, the network performance is significantly boosted compared to the baseline.
arXiv Detail & Related papers (2021-06-12T00:20:38Z) - Crossover Learning for Fast Online Video Instance Segmentation [53.5613957875507]
We present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames.
To our knowledge, CrossVIS achieves state-of-the-art performance among all online VIS methods and shows a decent trade-off between latency and accuracy.
arXiv Detail & Related papers (2021-04-13T06:47:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.