CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation
- URL: http://arxiv.org/abs/2403.12455v2
- Date: Sat, 8 Jun 2024 00:59:41 GMT
- Title: CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation
- Authors: Wenqi Zhu, Jiale Cao, Jin Xie, Shuangming Yang, Yanwei Pang,
- Abstract summary: We propose a simple encoder-decoder network, called CLIP-VIS, to adapt CLIP for open-vocabulary video instance segmentation.
Our CLIP-VIS adopts frozen CLIP image encoder and introduces three modules, including class-agnostic mask generation, temporal topK-enhanced matching, and weighted open-vocabulary classification.
- Score: 44.450243388665776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary video instance segmentation strives to segment and track instances belonging to an open set of categories in a video. The vision-language model Contrastive Language-Image Pre-training (CLIP) has shown robust zero-shot classification ability in image-level open-vocabulary task. In this paper, we propose a simple encoder-decoder network, called CLIP-VIS, to adapt CLIP for open-vocabulary video instance segmentation. Our CLIP-VIS adopts frozen CLIP image encoder and introduces three modules, including class-agnostic mask generation, temporal topK-enhanced matching, and weighted open-vocabulary classification. Given a set of initial queries, class-agnostic mask generation employs a transformer decoder to predict query masks and corresponding object scores and mask IoU scores. Then, temporal topK-enhanced matching performs query matching across frames by using K mostly matched frames. Finally, weighted open-vocabulary classification first generates query visual features with mask pooling, and second performs weighted classification using object scores and mask IoU scores.Our CLIP-VIS does not require the annotations of instance categories and identities. The experiments are performed on various video instance segmentation datasets, which demonstrate the effectiveness of our proposed method, especially on novel categories. When using ConvNeXt-B as backbone, our CLIP-VIS achieves the AP and APn scores of 32.2% and 40.2% on validation set of LV-VIS dataset, which outperforms OV2Seg by 11.1% and 23.9% respectively. We will release the source code and models at https://github.com/zwq456/CLIP-VIS.git.
Related papers
- OpenVIS: Open-vocabulary Video Instance Segmentation [26.107369797422145]
Open-vocabulary Video Instance (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video.
We propose an OpenVIS framework called InstFormer that achieves powerful open vocabulary capability.
arXiv Detail & Related papers (2023-05-26T11:25:59Z) - Towards Open-Vocabulary Video Instance Segmentation [61.469232166803465]
Video Instance aims at segmenting and categorizing objects in videos from a closed set of training categories.
We introduce the novel task of Open-Vocabulary Video Instance, which aims to simultaneously segment, track, and classify objects in videos from open-set categories.
To benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories.
arXiv Detail & Related papers (2023-04-04T11:25:23Z) - Unified Mask Embedding and Correspondence Learning for Self-Supervised
Video Segmentation [76.40565872257709]
We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning.
It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos.
Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
arXiv Detail & Related papers (2023-03-17T16:23:36Z) - GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation
Learning [55.77244064907146]
One-stage detector GridCLIP learns grid-level representations to adapt to the intrinsic principle of one-stage detection learning.
Experiments show that the learned CLIP-based grid-level representations boost the performance of undersampled (infrequent and novel) categories.
arXiv Detail & Related papers (2023-03-16T12:06:02Z) - Side Adapter Network for Open-Vocabulary Semantic Segmentation [69.18441687386733]
This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN)
A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias.
Our approach significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed.
arXiv Detail & Related papers (2023-02-23T18:58:28Z) - Open-Vocabulary Universal Image Segmentation with MaskCLIP [24.74805434602145]
We tackle an emerging computer vision task, open-vocabulary universal image segmentation.
We first build a baseline method by directly adopting pre-trained CLIP models.
We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual.
arXiv Detail & Related papers (2022-08-18T17:55:37Z) - One-stage Video Instance Segmentation: From Frame-in Frame-out to
Clip-in Clip-out [15.082477136581153]
We propose a clip-in clip-out (CiCo) framework to exploit temporal information in video clips.
CiCo strategy is free of interconditional-frame alignment, and can be easily embedded into existing FiFo based VIS approaches.
Two new one-stage VIS models achieve 37.7.3%, 35.2/35.4% and 17.2/1% mask AP.
arXiv Detail & Related papers (2022-03-12T12:23:21Z) - Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation.
To invigorate research on this new task, we present two types of video panoptic datasets.
We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.