Video Instance Segmentation in an Open-World
- URL: http://arxiv.org/abs/2304.01200v1
- Date: Mon, 3 Apr 2023 17:59:52 GMT
- Title: Video Instance Segmentation in an Open-World
- Authors: Omkar Thawakar, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer,
Salman Khan, Jorma Laaksonen, Mubarak Shah, Fahad Shahbaz Khan
- Abstract summary: Video instance segmentation (VIS) approaches generally follow a closed-world assumption.
We propose the first open-world VIS approach, named OW-VISFormer, that introduces a novel feature enrichment mechanism.
Our OW-VISFormer performs favorably against a solid baseline in OW-VIS setting.
- Score: 112.02667959850436
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing video instance segmentation (VIS) approaches generally follow a
closed-world assumption, where only seen category instances are identified and
spatio-temporally segmented at inference. Open-world formulation relaxes the
close-world static-learning assumption as follows: (a) first, it distinguishes
a set of known categories as well as labels an unknown object as `unknown' and
then (b) it incrementally learns the class of an unknown as and when the
corresponding semantic labels become available. We propose the first open-world
VIS approach, named OW-VISFormer, that introduces a novel feature enrichment
mechanism and a spatio-temporal objectness (STO) module. The feature enrichment
mechanism based on a light-weight auxiliary network aims at accurate
pixel-level (unknown) object delineation from the background as well as
distinguishing category-specific known semantic classes. The STO module strives
to generate instance-level pseudo-labels by enhancing the foreground
activations through a contrastive loss. Moreover, we also introduce an
extensive experimental protocol to measure the characteristics of OW-VIS. Our
OW-VISFormer performs favorably against a solid baseline in OW-VIS setting.
Further, we evaluate our contributions in the standard fully-supervised VIS
setting by integrating them into the recent SeqFormer, achieving an absolute
gain of 1.6\% AP on Youtube-VIS 2019 val. set. Lastly, we show the
generalizability of our contributions for the open-world detection (OWOD)
setting, outperforming the best existing OWOD method in the literature. Code,
models along with OW-VIS splits are available at
\url{https://github.com/OmkarThawakar/OWVISFormer}.
Related papers
- UVIS: Unsupervised Video Instance Segmentation [65.46196594721545]
Videocaption instance segmentation requires classifying, segmenting, and tracking every object across video frames.
We propose UVIS, a novel Unsupervised Video Instance (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining.
Our framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking.
arXiv Detail & Related papers (2024-06-11T03:05:50Z) - DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - OpenVIS: Open-vocabulary Video Instance Segmentation [24.860711503327323]
Open-vocabulary Video Instance (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video.
We propose InstFormer, a framework that achieves powerful open-vocabulary capabilities through lightweight fine-tuning with limited-category data.
arXiv Detail & Related papers (2023-05-26T11:25:59Z) - Towards Open-Vocabulary Video Instance Segmentation [61.469232166803465]
Video Instance aims at segmenting and categorizing objects in videos from a closed set of training categories.
We introduce the novel task of Open-Vocabulary Video Instance, which aims to simultaneously segment, track, and classify objects in videos from open-set categories.
To benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories.
arXiv Detail & Related papers (2023-04-04T11:25:23Z) - ElC-OIS: Ellipsoidal Clustering for Open-World Instance Segmentation on
LiDAR Data [13.978966783993146]
Open-world Instance (OIS) is a challenging task that aims to accurately segment every object instance appearing in the current observation.
This is important for safety-critical applications such as robust autonomous navigation.
We present a flexible and effective OIS framework for LiDAR point cloud that can accurately segment both known and unknown instances.
arXiv Detail & Related papers (2023-03-08T03:22:11Z) - A Generalized Framework for Video Instance Segmentation [49.41441806931224]
The handling of long videos with complex and occluded sequences has emerged as a new challenge in the video instance segmentation (VIS) community.
We propose a Generalized framework for VIS, namely GenVIS, that achieves state-of-the-art performance on challenging benchmarks.
We evaluate our approach on popular VIS benchmarks, achieving state-of-the-art results on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS)
arXiv Detail & Related papers (2022-11-16T11:17:19Z) - Spatio-temporal Relation Modeling for Few-shot Action Recognition [100.3999454780478]
We propose a few-shot action recognition framework, STRM, which enhances class-specific featureriminability while simultaneously learning higher-order temporal representations.
Our approach achieves an absolute gain of 3.5% in classification accuracy, as compared to the best existing method in the literature.
arXiv Detail & Related papers (2021-12-09T18:59:14Z) - Novel Class Discovery in Semantic Segmentation [104.30729847367104]
We introduce a new setting of Novel Class Discovery in Semantic (NCDSS)
It aims at segmenting unlabeled images containing new classes given prior knowledge from a labeled set of disjoint classes.
In NCDSS, we need to distinguish the objects and background, and to handle the existence of multiple classes within an image.
We propose the Entropy-based Uncertainty Modeling and Self-training (EUMS) framework to overcome noisy pseudo-labels.
arXiv Detail & Related papers (2021-12-03T13:31:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.