Related papers: OpenVIS: Open-vocabulary Video Instance Segmentation

OpenVIS: Open-vocabulary Video Instance Segmentation

URL: http://arxiv.org/abs/2305.16835v2
Date: Sun, 10 Mar 2024 08:23:58 GMT
Title: OpenVIS: Open-vocabulary Video Instance Segmentation
Authors: Pinxue Guo, Tony Huang, Peiyang He, Xuefeng Liu, Tianjun Xiao, Zhaoyu Chen, Wenqiang Zhang
Abstract summary: Open-vocabulary Video Instance (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video. We propose an OpenVIS framework called InstFormer that achieves powerful open vocabulary capability.
Score: 26.107369797422145
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-vocabulary Video Instance Segmentation (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video, without being constrained to categories seen during training. In this work, we propose an OpenVIS framework called InstFormer that achieves powerful open vocabulary capability through lightweight fine-tuning on a limited-category labeled dataset. Specifically, InstFormer comes in three steps a) Open-world Mask Proposal: we utilize a query-based transformer, which is encouraged to propose all potential object instances, to obtain class-agnostic instance masks; b) Open-vocabulary Instance Representation and Classification: we propose InstCLIP, adapted from pre-trained CLIP with Instance Guidance Attention. InstCLIP generates the instance token capable of representing each open-vocabulary instance. These instance tokens not only enable open-vocabulary classification for multiple instances with a single CLIP forward pass but have also been proven effective for subsequent open-vocabulary instance tracking. c) Rollout Association: we introduce a class-agnostic rollout tracker to predict rollout tokens from the tracking tokens of previous frames to enable open-vocabulary instance association across frames in the video. The experimental results demonstrate the proposed InstFormer achieve state-of-the-art capabilities on a comprehensive OpenVIS evaluation benchmark, while also achieves competitive performance in fully supervised VIS task.

Related papers

OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer [25.963586473288764]
Open-vocabulary multiple object tracking aims to generalize trackers to unseen categories during training. OVTR is the first end-to-end open-vocabulary tracker that models motion, appearance, and category simultaneously.
arXiv Detail & Related papers (2025-03-13T17:56:10Z)
What is Point Supervision Worth in Video Instance Segmentation? [119.71921319637748]
Video instance segmentation (VIS) is a challenging vision task that aims to detect, segment, and track objects in videos. We reduce the human annotations to only one point for each object in a video frame during training, and obtain high-quality mask predictions close to fully supervised models. Comprehensive experiments on three VIS benchmarks demonstrate competitive performance of the proposed framework, nearly matching fully supervised methods.
arXiv Detail & Related papers (2024-04-01T17:38:25Z)
Video Instance Segmentation in an Open-World [112.02667959850436]
Video instance segmentation (VIS) approaches generally follow a closed-world assumption. We propose the first open-world VIS approach, named OW-VISFormer, that introduces a novel feature enrichment mechanism. Our OW-VISFormer performs favorably against a solid baseline in OW-VIS setting.
arXiv Detail & Related papers (2023-04-03T17:59:52Z)
Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category. We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP. Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z)
A Simple Framework for Open-Vocabulary Segmentation and Detection [85.21641508535679]
We present OpenSeeD, a simple Open-vocabulary and Detection framework that jointly learns from different segmentation and detection datasets. We first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them. After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection.
arXiv Detail & Related papers (2023-03-14T17:58:34Z)
Towards Universal Vision-language Omni-supervised Segmentation [72.31277932442988]
We present Vision-Language Omni-Supervised (VLOSS) to treat open-world segmentation tasks as proposal classification. We leverage omni-supervised data (i.e., panoptic segmentation data, object detection data, and image-text pairs data) into training, thus enriching the open-world segmentation ability. With fewer parameters, our VLOSS with Swin-Tiny surpasses MaskCLIP by 2% in terms of mask AP on LVIS v1 dataset.
arXiv Detail & Related papers (2023-03-12T02:57:53Z)
Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework with Spatio-Temporal Collaboration [13.284951215948052]
We present a novel weakly supervised framework with textbfS-patiotextbfTemporal textbfClaboration for instance textbfSegmentation in videos. Our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN.
arXiv Detail & Related papers (2022-12-15T02:44:13Z)
Exemplar-Based Open-Set Panoptic Segmentation Network [79.99748041746592]
We extend panoptic segmentation to the open-world and introduce an open-set panoptic segmentation (OPS) task. We investigate the practical challenges of the task and construct a benchmark on top of an existing dataset, COCO. We propose a novel exemplar-based open-set panoptic segmentation network (EOPSN) inspired by exemplar theory.
arXiv Detail & Related papers (2021-05-18T07:59:21Z)
Salient Instance Segmentation with Region and Box-level Annotations [3.1458035003538884]
New generation of saliency detection provides strong theoretical and technical basis for video surveillance. Due to the limited scale of the existing dataset and the high mask annotations cost, plenty of supervision source is urgently needed to train a well-performing salient instance model. We propose a novel salient instance segmentation framework by an inexact supervision without resorting to laborious labeling.
arXiv Detail & Related papers (2020-08-19T03:43:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.