DVIS++: Improved Decoupled Framework for Universal Video Segmentation
- URL: http://arxiv.org/abs/2312.13305v1
- Date: Wed, 20 Dec 2023 03:01:33 GMT
- Title: DVIS++: Improved Decoupled Framework for Universal Video Segmentation
- Authors: Tao Zhang and Xingye Tian and Yikang Zhou and Shunping Ji and Xuebo
Wang and Xin Tao and Yuan Zhang and Pengfei Wan and Zhongyuan Wang and Yu Wu
- Abstract summary: We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
- Score: 30.703276476607545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the \textbf{D}ecoupled \textbf{VI}deo \textbf{S}egmentation (DVIS)
framework, a novel approach for the challenging task of universal video
segmentation, including video instance segmentation (VIS), video semantic
segmentation (VSS), and video panoptic segmentation (VPS). Unlike previous
methods that model video segmentation in an end-to-end manner, our approach
decouples video segmentation into three cascaded sub-tasks: segmentation,
tracking, and refinement. This decoupling design allows for simpler and more
effective modeling of the spatio-temporal representations of objects,
especially in complex scenes and long videos. Accordingly, we introduce two
novel components: the referring tracker and the temporal refiner. These
components track objects frame by frame and model spatio-temporal
representations based on pre-aligned features. To improve the tracking
capability of DVIS, we propose a denoising training strategy and introduce
contrastive learning, resulting in a more robust framework named DVIS++.
Furthermore, we evaluate DVIS++ in various settings, including open vocabulary
and using a frozen pre-trained backbone. By integrating CLIP with DVIS++, we
present OV-DVIS++, the first open-vocabulary universal video segmentation
framework. We conduct extensive experiments on six mainstream benchmarks,
including the VIS, VSS, and VPS datasets. Using a unified architecture, DVIS++
significantly outperforms state-of-the-art specialized methods on these
benchmarks in both close- and open-vocabulary settings.
Code:~\url{https://github.com/zhang-tao-whu/DVIS_Plus}.
Related papers
- UVIS: Unsupervised Video Instance Segmentation [65.46196594721545]
Videocaption instance segmentation requires classifying, segmenting, and tracking every object across video frames.
We propose UVIS, a novel Unsupervised Video Instance (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining.
Our framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking.
arXiv Detail & Related papers (2024-06-11T03:05:50Z) - Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA)
It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.
We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z) - DVIS: Decoupled Video Instance Segmentation Framework [15.571072365208872]
Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing.
Existing methods often underperform on complex and long videos in real world, primarily due to two factors.
We propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement.
arXiv Detail & Related papers (2023-06-06T05:24:15Z) - Towards Open-Vocabulary Video Instance Segmentation [61.469232166803465]
Video Instance aims at segmenting and categorizing objects in videos from a closed set of training categories.
We introduce the novel task of Open-Vocabulary Video Instance, which aims to simultaneously segment, track, and classify objects in videos from open-set categories.
To benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories.
arXiv Detail & Related papers (2023-04-04T11:25:23Z) - Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos.
Most clip-level methods are neither end-to-end learnable nor real-time.
This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z) - STC: Spatio-Temporal Contrastive Learning for Video Instance
Segmentation [47.28515170195206]
Video Instance (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video.
Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions.
We present a simple and efficient single-stage VIS framework based on the instance segmentation method ConInst.
arXiv Detail & Related papers (2022-02-08T09:34:26Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z) - Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation.
To invigorate research on this new task, we present two types of video panoptic datasets.
We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.