CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation
- URL: http://arxiv.org/abs/2403.12455v3
- Date: Tue, 08 Oct 2024 10:21:31 GMT
- Title: CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation
- Authors: Wenqi Zhu, Jiale Cao, Jin Xie, Shuangming Yang, Yanwei Pang,
- Abstract summary: We propose a simple encoder-decoder network, called CLIP-VIS, to adapt CLIP for open-vocabulary video instance segmentation.
Our CLIP-VIS adopts frozen CLIP and introduces three modules, including class-agnostic mask generation, temporal topK-enhanced matching, and weighted open-vocabulary classification.
- Score: 44.450243388665776
- License:
- Abstract: Open-vocabulary video instance segmentation strives to segment and track instances belonging to an open set of categories in a videos. The vision-language model Contrastive Language-Image Pre-training (CLIP) has shown robust zero-shot classification ability in image-level open-vocabulary tasks. In this paper, we propose a simple encoder-decoder network, called CLIP-VIS, to adapt CLIP for open-vocabulary video instance segmentation. Our CLIP-VIS adopts frozen CLIP and introduces three modules, including class-agnostic mask generation, temporal topK-enhanced matching, and weighted open-vocabulary classification. Given a set of initial queries, class-agnostic mask generation introduces a pixel decoder and a transformer decoder on CLIP pre-trained image encoder to predict query masks and corresponding object scores and mask IoU scores. Then, temporal topK-enhanced matching performs query matching across frames using the K mostly matched frames. Finally, weighted open-vocabulary classification first employs mask pooling to generate query visual features from CLIP pre-trained image encoder, and second performs weighted classification using object scores and mask IoU scores. Our CLIP-VIS does not require the annotations of instance categories and identities. The experiments are performed on various video instance segmentation datasets, which demonstrate the effectiveness of our proposed method, especially for novel categories. When using ConvNeXt-B as backbone, our CLIP-VIS achieves the AP and APn scores of 32.2% and 40.2% on the validation set of LV-VIS dataset, which outperforms OV2Seg by 11.1% and 23.9% respectively. We will release the source code and models at https://github.com/zwq456/CLIP-VIS.git.
Related papers
- Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [53.8817160001038]
We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding.
To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm.
PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
arXiv Detail & Related papers (2024-09-30T01:13:03Z) - Unified Mask Embedding and Correspondence Learning for Self-Supervised
Video Segmentation [76.40565872257709]
We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning.
It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos.
Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
arXiv Detail & Related papers (2023-03-17T16:23:36Z) - GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation
Learning [55.77244064907146]
One-stage detector GridCLIP learns grid-level representations to adapt to the intrinsic principle of one-stage detection learning.
Experiments show that the learned CLIP-based grid-level representations boost the performance of undersampled (infrequent and novel) categories.
arXiv Detail & Related papers (2023-03-16T12:06:02Z) - Side Adapter Network for Open-Vocabulary Semantic Segmentation [69.18441687386733]
This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN)
A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias.
Our approach significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed.
arXiv Detail & Related papers (2023-02-23T18:58:28Z) - CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly
Supervised Semantic Segmentation [19.208559353954833]
This paper explores the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels.
To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES.
arXiv Detail & Related papers (2022-12-16T06:23:59Z) - Open-Vocabulary Universal Image Segmentation with MaskCLIP [24.74805434602145]
We tackle an emerging computer vision task, open-vocabulary universal image segmentation.
We first build a baseline method by directly adopting pre-trained CLIP models.
We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual.
arXiv Detail & Related papers (2022-08-18T17:55:37Z) - One-stage Video Instance Segmentation: From Frame-in Frame-out to
Clip-in Clip-out [15.082477136581153]
We propose a clip-in clip-out (CiCo) framework to exploit temporal information in video clips.
CiCo strategy is free of interconditional-frame alignment, and can be easily embedded into existing FiFo based VIS approaches.
Two new one-stage VIS models achieve 37.7.3%, 35.2/35.4% and 17.2/1% mask AP.
arXiv Detail & Related papers (2022-03-12T12:23:21Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.