Towards Open-Vocabulary Video Instance Segmentation
- URL: http://arxiv.org/abs/2304.01715v2
- Date: Sun, 6 Aug 2023 20:08:58 GMT
- Title: Towards Open-Vocabulary Video Instance Segmentation
- Authors: Haochen Wang, Cilin Yan, Shuai Wang, Xiaolong Jiang, XU Tang, Yao Hu,
Weidi Xie, Efstratios Gavves
- Abstract summary: Video Instance aims at segmenting and categorizing objects in videos from a closed set of training categories.
We introduce the novel task of Open-Vocabulary Video Instance, which aims to simultaneously segment, track, and classify objects in videos from open-set categories.
To benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories.
- Score: 61.469232166803465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Instance Segmentation (VIS) aims at segmenting and categorizing objects
in videos from a closed set of training categories, lacking the generalization
ability to handle novel categories in real-world videos. To address this
limitation, we make the following three contributions. First, we introduce the
novel task of Open-Vocabulary Video Instance Segmentation, which aims to
simultaneously segment, track, and classify objects in videos from open-set
categories, including novel categories unseen during training. Second, to
benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance
Segmentation dataset (LV-VIS), that contains well-annotated objects from 1,196
diverse categories, significantly surpassing the category size of existing
datasets by more than one order of magnitude. Third, we propose an efficient
Memory-Induced Transformer architecture, OV2Seg, to first achieve
Open-Vocabulary VIS in an end-to-end manner with near real-time inference
speed. Extensive experiments on LV-VIS and four existing VIS datasets
demonstrate the strong zero-shot generalization ability of OV2Seg on novel
categories. The dataset and code are released here
https://github.com/haochenheheda/LVVIS.
Related papers
- ReferEverything: Towards Segmenting Everything We Can Speak of in Videos [42.88584315033116]
We present REM, a framework for segmenting concepts in video that can be described through natural language.
Our method capitalizes on visual representations learned by video diffusion models on Internet-scale datasets.
arXiv Detail & Related papers (2024-10-30T17:59:26Z) - Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation [28.360157186395686]
Open-Vocabulary Video Instance (VIS) is attracting increasing attention due to its ability to segment and track arbitrary objects.
We propose a novel Open-Vocabulary VIS baseline called OVFormer.
OVFormer utilizes a lightweight module for unified embedding alignment between query embeddings and CLIP image embeddings.
Unlike previous image-based training methods, we conduct video-based model training and deploy a semi-online inference scheme to fully mine the temporal consistency in the video.
arXiv Detail & Related papers (2024-07-10T07:30:51Z) - UVIS: Unsupervised Video Instance Segmentation [65.46196594721545]
Videocaption instance segmentation requires classifying, segmenting, and tracking every object across video frames.
We propose UVIS, a novel Unsupervised Video Instance (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining.
Our framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking.
arXiv Detail & Related papers (2024-06-11T03:05:50Z) - DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA)
It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.
We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z) - Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos.
Most clip-level methods are neither end-to-end learnable nor real-time.
This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z) - Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation.
To invigorate research on this new task, we present two types of video panoptic datasets.
We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.