Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation
- URL: http://arxiv.org/abs/2407.07427v2
- Date: Fri, 12 Jul 2024 02:49:45 GMT
- Title: Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation
- Authors: Hao Fang, Peng Wu, Yawei Li, Xinxin Zhang, Xiankai Lu,
- Abstract summary: Open-Vocabulary Video Instance (VIS) is attracting increasing attention due to its ability to segment and track arbitrary objects.
We propose a novel Open-Vocabulary VIS baseline called OVFormer.
OVFormer utilizes a lightweight module for unified embedding alignment between query embeddings and CLIP image embeddings.
Unlike previous image-based training methods, we conduct video-based model training and deploy a semi-online inference scheme to fully mine the temporal consistency in the video.
- Score: 28.360157186395686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-Vocabulary Video Instance Segmentation (VIS) is attracting increasing attention due to its ability to segment and track arbitrary objects. However, the recent Open-Vocabulary VIS attempts obtained unsatisfactory results, especially in terms of generalization ability of novel categories. We discover that the domain gap between the VLM features (e.g., CLIP) and the instance queries and the underutilization of temporal consistency are two central causes. To mitigate these issues, we design and train a novel Open-Vocabulary VIS baseline called OVFormer. OVFormer utilizes a lightweight module for unified embedding alignment between query embeddings and CLIP image embeddings to remedy the domain gap. Unlike previous image-based training methods, we conduct video-based model training and deploy a semi-online inference scheme to fully mine the temporal consistency in the video. Without bells and whistles, OVFormer achieves 21.9 mAP with a ResNet-50 backbone on LV-VIS, exceeding the previous state-of-the-art performance by 7.7. Extensive experiments on some Close-Vocabulary VIS datasets also demonstrate the strong zero-shot generalization ability of OVFormer (+ 7.6 mAP on YouTube-VIS 2019, + 3.9 mAP on OVIS). Code is available at https://github.com/fanghaook/OVFormer.
Related papers
- UVIS: Unsupervised Video Instance Segmentation [65.46196594721545]
Videocaption instance segmentation requires classifying, segmenting, and tracking every object across video frames.
We propose UVIS, a novel Unsupervised Video Instance (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining.
Our framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking.
arXiv Detail & Related papers (2024-06-11T03:05:50Z) - DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - Towards Open-Vocabulary Video Instance Segmentation [61.469232166803465]
Video Instance aims at segmenting and categorizing objects in videos from a closed set of training categories.
We introduce the novel task of Open-Vocabulary Video Instance, which aims to simultaneously segment, track, and classify objects in videos from open-set categories.
To benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories.
arXiv Detail & Related papers (2023-04-04T11:25:23Z) - Video Instance Segmentation in an Open-World [112.02667959850436]
Video instance segmentation (VIS) approaches generally follow a closed-world assumption.
We propose the first open-world VIS approach, named OW-VISFormer, that introduces a novel feature enrichment mechanism.
Our OW-VISFormer performs favorably against a solid baseline in OW-VIS setting.
arXiv Detail & Related papers (2023-04-03T17:59:52Z) - A Generalized Framework for Video Instance Segmentation [49.41441806931224]
The handling of long videos with complex and occluded sequences has emerged as a new challenge in the video instance segmentation (VIS) community.
We propose a Generalized framework for VIS, namely GenVIS, that achieves state-of-the-art performance on challenging benchmarks.
We evaluate our approach on popular VIS benchmarks, achieving state-of-the-art results on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS)
arXiv Detail & Related papers (2022-11-16T11:17:19Z) - Video Instance Segmentation via Multi-scale Spatio-temporal Split
Attention Transformer [77.95612004326055]
Video segmentation (VIS) approaches typically utilize either single-scale-temporal features or per-frame multi-scale features during the attention computation.
We propose a transformer-based VIS framework, named MS-STS VIS, that comprises a novel multi-scale-temporal (MS-STS) attention module in the encoder.
MS-STS module effectively captures split-temporal feature relationships at multiple scales across frames in a video.
arXiv Detail & Related papers (2022-03-24T17:59:20Z) - Deformable VisTR: Spatio temporal deformable attention for video
instance segmentation [79.76273774737555]
Video instance segmentation (VIS) task requires segmenting, classifying, and tracking object instances over all frames in a clip.
Recently, VisTR has been proposed as end-to-end transformer-based VIS framework, while demonstrating state-of-the-art performance.
We propose Deformable VisTR, leveragingtemporal deformable attention module that only attends to a small fixed set key-temporal sampling points.
arXiv Detail & Related papers (2022-03-12T02:27:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.