Related papers: Mask-Free Video Instance Segmentation

Mask-Free Video Instance Segmentation

URL: http://arxiv.org/abs/2303.15904v1
Date: Tue, 28 Mar 2023 11:48:07 GMT
Title: Mask-Free Video Instance Segmentation
Authors: Lei Ke, Martin Danelljan, Henghui Ding, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu
Abstract summary: Video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection.
Score: 102.50936366583106
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent advancement in Video Instance Segmentation (VIS) has largely been driven by the use of deeper and increasingly data-hungry transformer-based models. However, video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. In this work, we aim to remove the mask-annotation requirement. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. We leverage the rich temporal mask consistency constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong mask supervision without any labels. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection. A consistency loss is then enforced on the found matches. Our mask-free objective is simple to implement, has no trainable parameters, is computationally efficient, yet outperforms baselines employing, e.g., state-of-the-art optical flow to enforce temporal mask consistency. We validate MaskFreeVIS on the YouTube-VIS 2019/2021, OVIS and BDD100K MOTS benchmarks. The results clearly demonstrate the efficacy of our method by drastically narrowing the gap between fully and weakly-supervised VIS performance. Our code and trained models are available at https://github.com/SysCV/MaskFreeVis.

Related papers

Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence. We propose an efficient mask propagation framework for VSS, called SSSS. Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z)
DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks [76.24996889649744]
Masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS) We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. Our model sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets.
arXiv Detail & Related papers (2023-04-02T16:40:42Z)
BoxVIS: Video Instance Segmentation with Box Annotations [15.082477136581153]
We adapt the state-of-the-art pixel-supervised VIS models to a box-supervised VIS baseline and observe slight performance degradation. We propose a box-center guided spatial-temporal pairwise affinity loss to predict instance masks for better spatial and temporal consistency. It exhibits comparable instance mask prediction performance and better generalization ability than state-of-the-art pixel-supervised VIS models by using only 16% of their annotation time and cost.
arXiv Detail & Related papers (2023-03-26T04:04:58Z)
One-Shot Video Inpainting [5.7120338754738835]
We propose a unified pipeline for one-shot video inpainting (OSVI) By jointly learning mask prediction and video completion in an end-to-end manner, the results can be optimal for the entire task. Our method is more reliable because the predicted masks can be used as the network's internal guidance.
arXiv Detail & Related papers (2023-02-28T07:30:36Z)
Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure. Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction. We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z)
Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation [51.68840525174265]
Video instance segmentation aims to detect, segment, and track objects in a video. Current approaches extend image-level segmentation algorithms to the temporal domain. We propose a video instance segmentation method that alleviates the problem due to missing detections.
arXiv Detail & Related papers (2021-11-15T04:15:57Z)
MSN: Efficient Online Mask Selection Network for Video Instance Segmentation [7.208483056781188]
We present a novel solution for Video Instance(VIS) that is automatically generating instance level segmentation masks along with object class and tracking them in a video. Our method improves the masks from segmentation and propagation branches in an online manner using the Mask Selection Network (MSN) Our method achieves a score of 49.1 mAP on 2021 YouTube-VIS Challenge and was ranked third place among more than 30 global teams.
arXiv Detail & Related papers (2021-06-19T08:33:29Z)
Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos. We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks. The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)
Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting. We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.