Mask-Free Video Instance Segmentation
- URL: http://arxiv.org/abs/2303.15904v1
- Date: Tue, 28 Mar 2023 11:48:07 GMT
- Title: Mask-Free Video Instance Segmentation
- Authors: Lei Ke, Martin Danelljan, Henghui Ding, Yu-Wing Tai, Chi-Keung Tang,
Fisher Yu
- Abstract summary: Video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets.
We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state.
Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection.
- Score: 102.50936366583106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent advancement in Video Instance Segmentation (VIS) has largely been
driven by the use of deeper and increasingly data-hungry transformer-based
models. However, video masks are tedious and expensive to annotate, limiting
the scale and diversity of existing VIS datasets. In this work, we aim to
remove the mask-annotation requirement. We propose MaskFreeVIS, achieving
highly competitive VIS performance, while only using bounding box annotations
for the object state. We leverage the rich temporal mask consistency
constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss),
providing strong mask supervision without any labels. Our TK-Loss finds
one-to-many matches across frames, through an efficient patch-matching step
followed by a K-nearest neighbor selection. A consistency loss is then enforced
on the found matches. Our mask-free objective is simple to implement, has no
trainable parameters, is computationally efficient, yet outperforms baselines
employing, e.g., state-of-the-art optical flow to enforce temporal mask
consistency. We validate MaskFreeVIS on the YouTube-VIS 2019/2021, OVIS and
BDD100K MOTS benchmarks. The results clearly demonstrate the efficacy of our
method by drastically narrowing the gap between fully and weakly-supervised VIS
performance. Our code and trained models are available at
https://github.com/SysCV/MaskFreeVis.
Related papers
- DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking
Tasks [76.24996889649744]
Masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS)
We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.
Our model sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets.
arXiv Detail & Related papers (2023-04-02T16:40:42Z) - BoxVIS: Video Instance Segmentation with Box Annotations [15.082477136581153]
We adapt the state-of-the-art pixel-supervised VIS models to a box-supervised VIS baseline and observe slight performance degradation.
We propose a box-center guided spatial-temporal pairwise affinity loss to predict instance masks for better spatial and temporal consistency.
It exhibits comparable instance mask prediction performance and better generalization ability than state-of-the-art pixel-supervised VIS models by using only 16% of their annotation time and cost.
arXiv Detail & Related papers (2023-03-26T04:04:58Z) - One-Shot Video Inpainting [5.7120338754738835]
We propose a unified pipeline for one-shot video inpainting (OSVI)
By jointly learning mask prediction and video completion in an end-to-end manner, the results can be optimal for the entire task.
Our method is more reliable because the predicted masks can be used as the network's internal guidance.
arXiv Detail & Related papers (2023-02-28T07:30:36Z) - MixMask: Revisiting Masking Strategy for Siamese ConvNets [24.20212182301359]
We propose a filling-based masking strategy called MixMask to prevent information incompleteness caused by the randomly erased regions in an image.
Our proposed framework achieves superior accuracy on linear probing, semi-supervised, and supervised finetuning, outperforming the state-of-the-art MSCN by a significant margin.
arXiv Detail & Related papers (2022-10-20T17:54:03Z) - Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.
Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction.
We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z) - Object Propagation via Inter-Frame Attentions for Temporally Stable
Video Instance Segmentation [51.68840525174265]
Video instance segmentation aims to detect, segment, and track objects in a video.
Current approaches extend image-level segmentation algorithms to the temporal domain.
We propose a video instance segmentation method that alleviates the problem due to missing detections.
arXiv Detail & Related papers (2021-11-15T04:15:57Z) - MSN: Efficient Online Mask Selection Network for Video Instance
Segmentation [7.208483056781188]
We present a novel solution for Video Instance(VIS) that is automatically generating instance level segmentation masks along with object class and tracking them in a video.
Our method improves the masks from segmentation and propagation branches in an online manner using the Mask Selection Network (MSN)
Our method achieves a score of 49.1 mAP on 2021 YouTube-VIS Challenge and was ranked third place among more than 30 global teams.
arXiv Detail & Related papers (2021-06-19T08:33:29Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.