OpenVIS: Open-vocabulary Video Instance Segmentation
- URL: http://arxiv.org/abs/2305.16835v3
- Date: Sat, 17 Aug 2024 09:30:31 GMT
- Title: OpenVIS: Open-vocabulary Video Instance Segmentation
- Authors: Pinxue Guo, Tony Huang, Peiyang He, Xuefeng Liu, Tianjun Xiao, Zhaoyu Chen, Wenqiang Zhang,
- Abstract summary: Open-vocabulary Video Instance (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video.
We propose InstFormer, a framework that achieves powerful open-vocabulary capabilities through lightweight fine-tuning with limited-category data.
- Score: 24.860711503327323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary Video Instance Segmentation (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video, without being constrained to categories seen during training. In this work, we propose InstFormer, a carefully designed framework for the OpenVIS task that achieves powerful open-vocabulary capabilities through lightweight fine-tuning with limited-category data. InstFormer begins with the open-world mask proposal network, encouraged to propose all potential instance class-agnostic masks by the contrastive instance margin loss. Next, we introduce InstCLIP, adapted from pre-trained CLIP with Instance Guidance Attention, which encodes open-vocabulary instance tokens efficiently. These instance tokens not only enable open-vocabulary classification but also offer strong universal tracking capabilities. Furthermore, to prevent the tracking module from being constrained by the training data with limited categories, we propose the universal rollout association, which transforms the tracking problem into predicting the next frame's instance tracking token. The experimental results demonstrate the proposed InstFormer achieve state-of-the-art capabilities on a comprehensive OpenVIS evaluation benchmark, while also achieves competitive performance in fully supervised VIS task.
Related papers
- What is Point Supervision Worth in Video Instance Segmentation? [119.71921319637748]
Video instance segmentation (VIS) is a challenging vision task that aims to detect, segment, and track objects in videos.
We reduce the human annotations to only one point for each object in a video frame during training, and obtain high-quality mask predictions close to fully supervised models.
Comprehensive experiments on three VIS benchmarks demonstrate competitive performance of the proposed framework, nearly matching fully supervised methods.
arXiv Detail & Related papers (2024-04-01T17:38:25Z) - Video Instance Segmentation in an Open-World [112.02667959850436]
Video instance segmentation (VIS) approaches generally follow a closed-world assumption.
We propose the first open-world VIS approach, named OW-VISFormer, that introduces a novel feature enrichment mechanism.
Our OW-VISFormer performs favorably against a solid baseline in OW-VIS setting.
arXiv Detail & Related papers (2023-04-03T17:59:52Z) - Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z) - A Simple Framework for Open-Vocabulary Segmentation and Detection [85.21641508535679]
We present OpenSeeD, a simple Open-vocabulary and Detection framework that jointly learns from different segmentation and detection datasets.
We first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them.
After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection.
arXiv Detail & Related papers (2023-03-14T17:58:34Z) - Towards Universal Vision-language Omni-supervised Segmentation [72.31277932442988]
We present Vision-Language Omni-Supervised (VLOSS) to treat open-world segmentation tasks as proposal classification.
We leverage omni-supervised data (i.e., panoptic segmentation data, object detection data, and image-text pairs data) into training, thus enriching the open-world segmentation ability.
With fewer parameters, our VLOSS with Swin-Tiny surpasses MaskCLIP by 2% in terms of mask AP on LVIS v1 dataset.
arXiv Detail & Related papers (2023-03-12T02:57:53Z) - Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised
Framework with Spatio-Temporal Collaboration [13.284951215948052]
We present a novel weakly supervised framework with textbfS-patiotextbfTemporal textbfClaboration for instance textbfSegmentation in videos.
Our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN.
arXiv Detail & Related papers (2022-12-15T02:44:13Z) - Exemplar-Based Open-Set Panoptic Segmentation Network [79.99748041746592]
We extend panoptic segmentation to the open-world and introduce an open-set panoptic segmentation (OPS) task.
We investigate the practical challenges of the task and construct a benchmark on top of an existing dataset, COCO.
We propose a novel exemplar-based open-set panoptic segmentation network (EOPSN) inspired by exemplar theory.
arXiv Detail & Related papers (2021-05-18T07:59:21Z) - Salient Instance Segmentation with Region and Box-level Annotations [3.1458035003538884]
New generation of saliency detection provides strong theoretical and technical basis for video surveillance.
Due to the limited scale of the existing dataset and the high mask annotations cost, plenty of supervision source is urgently needed to train a well-performing salient instance model.
We propose a novel salient instance segmentation framework by an inexact supervision without resorting to laborious labeling.
arXiv Detail & Related papers (2020-08-19T03:43:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.