BoxVIS: Video Instance Segmentation with Box Annotations
- URL: http://arxiv.org/abs/2303.14618v2
- Date: Wed, 12 Jul 2023 10:44:51 GMT
- Title: BoxVIS: Video Instance Segmentation with Box Annotations
- Authors: Minghan Li and Lei Zhang
- Abstract summary: We adapt the state-of-the-art pixel-supervised VIS models to a box-supervised VIS baseline and observe slight performance degradation.
We propose a box-center guided spatial-temporal pairwise affinity loss to predict instance masks for better spatial and temporal consistency.
It exhibits comparable instance mask prediction performance and better generalization ability than state-of-the-art pixel-supervised VIS models by using only 16% of their annotation time and cost.
- Score: 15.082477136581153
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: It is expensive and labour-extensive to label the pixel-wise object masks in
a video. As a result, the amount of pixel-wise annotations in existing video
instance segmentation (VIS) datasets is small, limiting the generalization
capability of trained VIS models. An alternative but much cheaper solution is
to use bounding boxes to label instances in videos. Inspired by the recent
success of box-supervised image instance segmentation, we adapt the
state-of-the-art pixel-supervised VIS models to a box-supervised VIS (BoxVIS)
baseline, and observe slight performance degradation. We consequently propose
to improve the BoxVIS performance from two aspects. First, we propose a
box-center guided spatial-temporal pairwise affinity (STPA) loss to predict
instance masks for better spatial and temporal consistency. Second, we collect
a larger scale box-annotated VIS dataset (BVISD) by consolidating the videos
from current VIS benchmarks and converting images from the COCO dataset to
short pseudo video clips. With the proposed BVISD and the STPA loss, our
trained BoxVIS model achieves 43.2\% and 29.0\% mask AP on the YouTube-VIS 2021
and OVIS valid sets, respectively. It exhibits comparable instance mask
prediction performance and better generalization ability than state-of-the-art
pixel-supervised VIS models by using only 16\% of their annotation time and
cost. Codes and data can be found at \url{https://github.com/MinghanLi/BoxVIS}.
Related papers
- PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation [15.9587266448337]
Video instance segmentation requires detecting, segmenting, and tracking objects in videos.
This paper introduces a method that eliminates video annotations by utilizing image datasets.
arXiv Detail & Related papers (2024-06-28T05:22:39Z) - UVIS: Unsupervised Video Instance Segmentation [65.46196594721545]
Videocaption instance segmentation requires classifying, segmenting, and tracking every object across video frames.
We propose UVIS, a novel Unsupervised Video Instance (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining.
Our framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking.
arXiv Detail & Related papers (2024-06-11T03:05:50Z) - PM-VIS: High-Performance Box-Supervised Video Instance Segmentation [30.453433078039133]
Box-supervised Video Instance (VIS) methods have emerged as a viable solution to mitigate the labor-intensive annotation process.
We introduce a novel approach that aims at harnessing instance box annotations to generate high-quality instance pseudo masks.
Our PM-VIS model, trained with high-quality pseudo mask annotations, demonstrates strong ability in instance mask prediction.
arXiv Detail & Related papers (2024-04-22T04:25:02Z) - Mask-Free Video Instance Segmentation [102.50936366583106]
Video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets.
We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state.
Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection.
arXiv Detail & Related papers (2023-03-28T11:48:07Z) - MinVIS: A Minimal Video Instance Segmentation Framework without
Video-based Training [84.81566912372328]
MinVIS is a minimal video instance segmentation framework.
It achieves state-of-the-art VIS performance with neither video-based architectures nor training procedures.
arXiv Detail & Related papers (2022-08-03T17:50:42Z) - Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.
Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction.
We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z) - DeVIS: Making Deformable Transformers Work for Video Instance
Segmentation [4.3012765978447565]
Video Instance (VIS) jointly tackles multi-object detection, tracking, and segmentation in video sequences.
Transformers recently allowed to cast the entire VIS task as a single set-prediction problem.
Deformable attention provides a more efficient alternative but its application to the temporal domain or the segmentation task have not yet been explored.
arXiv Detail & Related papers (2022-07-22T14:27:45Z) - Crossover Learning for Fast Online Video Instance Segmentation [53.5613957875507]
We present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames.
To our knowledge, CrossVIS achieves state-of-the-art performance among all online VIS methods and shows a decent trade-off between latency and accuracy.
arXiv Detail & Related papers (2021-04-13T06:47:40Z) - Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation.
To invigorate research on this new task, we present two types of video panoptic datasets.
We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.