PM-VIS: High-Performance Box-Supervised Video Instance Segmentation
- URL: http://arxiv.org/abs/2404.13863v1
- Date: Mon, 22 Apr 2024 04:25:02 GMT
- Title: PM-VIS: High-Performance Box-Supervised Video Instance Segmentation
- Authors: Zhangjing Yang, Dun Liu, Wensheng Cheng, Jinqiao Wang, Yi Wu,
- Abstract summary: Box-supervised Video Instance (VIS) methods have emerged as a viable solution to mitigate the labor-intensive annotation process.
We introduce a novel approach that aims at harnessing instance box annotations to generate high-quality instance pseudo masks.
Our PM-VIS model, trained with high-quality pseudo mask annotations, demonstrates strong ability in instance mask prediction.
- Score: 30.453433078039133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Labeling pixel-wise object masks in videos is a resource-intensive and laborious process. Box-supervised Video Instance Segmentation (VIS) methods have emerged as a viable solution to mitigate the labor-intensive annotation process. . In practical applications, the two-step approach is not only more flexible but also exhibits a higher recognition accuracy. Inspired by the recent success of Segment Anything Model (SAM), we introduce a novel approach that aims at harnessing instance box annotations from multiple perspectives to generate high-quality instance pseudo masks, thus enriching the information contained in instance annotations. We leverage ground-truth boxes to create three types of pseudo masks using the HQ-SAM model, the box-supervised VIS model (IDOL-BoxInst), and the VOS model (DeAOT) separately, along with three corresponding optimization mechanisms. Additionally, we introduce two ground-truth data filtering methods, assisted by high-quality pseudo masks, to further enhance the training dataset quality and improve the performance of fully supervised VIS methods. To fully capitalize on the obtained high-quality Pseudo Masks, we introduce a novel algorithm, PM-VIS, to integrate mask losses into IDOL-BoxInst. Our PM-VIS model, trained with high-quality pseudo mask annotations, demonstrates strong ability in instance mask prediction, achieving state-of-the-art performance on the YouTube-VIS 2019, YouTube-VIS 2021, and OVIS validation sets, notably narrowing the gap between box-supervised and fully supervised VIS methods.
Related papers
- Triple Point Masking [49.39218611030084]
Existing 3D mask learning methods encounter performance bottlenecks under limited data.
We introduce a triple point masking scheme, named TPM, which serves as a scalable framework for pre-training of masked autoencoders.
Extensive experiments show that the four baselines equipped with the proposed TPM achieve comprehensive performance improvements on various downstream tasks.
arXiv Detail & Related papers (2024-09-26T05:33:30Z) - Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation [42.020470627552136]
Open-vocabulary panoptic segmentation is an emerging task aiming to accurately segment the image into semantically meaningful masks.
mask classification is the main performance bottleneck for open-vocab panoptic segmentation.
We propose Semantic Refocused Tuning, a novel framework that greatly enhances open-vocab panoptic segmentation.
arXiv Detail & Related papers (2024-09-24T17:50:28Z) - PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption.
We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model [5.632631449489529]
Segment Anything Model (SAM) introduces a new prompt-driven paradigm for image segmentation, offering new possibilities.
We propose UVOSAM, a mask-free paradigm for UVOS that utilizes the STD-Net tracker.
STD-Net incorporates a spatial-temporal decoupled deformable attention mechanism to establish an effective correlation between intra- and inter-frame features.
arXiv Detail & Related papers (2023-05-22T03:03:29Z) - BoxVIS: Video Instance Segmentation with Box Annotations [15.082477136581153]
We adapt the state-of-the-art pixel-supervised VIS models to a box-supervised VIS baseline and observe slight performance degradation.
We propose a box-center guided spatial-temporal pairwise affinity loss to predict instance masks for better spatial and temporal consistency.
It exhibits comparable instance mask prediction performance and better generalization ability than state-of-the-art pixel-supervised VIS models by using only 16% of their annotation time and cost.
arXiv Detail & Related papers (2023-03-26T04:04:58Z) - Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised
Framework with Spatio-Temporal Collaboration [13.284951215948052]
We present a novel weakly supervised framework with textbfS-patiotextbfTemporal textbfClaboration for instance textbfSegmentation in videos.
Our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN.
arXiv Detail & Related papers (2022-12-15T02:44:13Z) - Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.
Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction.
We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z) - MIST: Multiple Instance Self-Training Framework for Video Anomaly
Detection [76.80153360498797]
We develop a multiple instance self-training framework (MIST) to efficiently refine task-specific discriminative representations.
MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder.
Our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.
arXiv Detail & Related papers (2021-04-04T15:47:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.