Video Mask Transfiner for High-Quality Video Instance Segmentation
- URL: http://arxiv.org/abs/2207.14012v1
- Date: Thu, 28 Jul 2022 11:13:37 GMT
- Title: Video Mask Transfiner for High-Quality Video Instance Segmentation
- Authors: Lei Ke, Henghui Ding, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang,
Fisher Yu
- Abstract summary: Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.
Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction.
We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
- Score: 102.50936366583106
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: While Video Instance Segmentation (VIS) has seen rapid progress, current
approaches struggle to predict high-quality masks with accurate boundary
details. Moreover, the predicted segmentations often fluctuate over time,
suggesting that temporal consistency cues are neglected or not fully utilized.
In this paper, we set out to tackle these issues, with the aim of achieving
highly detailed and more temporally stable mask predictions for VIS. We first
propose the Video Mask Transfiner (VMT) method, capable of leveraging
fine-grained high-resolution features thanks to a highly efficient video
transformer structure. Our VMT detects and groups sparse error-prone
spatio-temporal regions of each tracklet in the video segment, which are then
refined using both local and instance-level cues. Second, we identify that the
coarse boundary annotations of the popular YouTube-VIS dataset constitute a
major limiting factor. Based on our VMT architecture, we therefore design an
automated annotation refinement approach by iterative training and
self-correction. To benchmark high-quality mask predictions for VIS, we
introduce the HQ-YTVIS dataset, consisting of a manually re-annotated test set
and our automatically refined training data. We compare VMT with the most
recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS,
OVIS and BDD100K MOTS benchmarks. Experimental results clearly demonstrate the
efficacy and effectiveness of our method on segmenting complex and dynamic
objects, by capturing precise details.
Related papers
- AU-vMAE: Knowledge-Guide Action Units Detection via Video Masked Autoencoder [38.04963261966939]
We propose a video-level pre-training scheme for facial action units (FAU) detection.
At the heart of our design is a pre-trained video feature extractor based on the video-masked autoencoder.
Our approach demonstrates substantial enhancement in performance compared to the existing state-of-the-art methods used in BP4D and DISFA FAUs datasets.
arXiv Detail & Related papers (2024-07-16T08:07:47Z) - PM-VIS: High-Performance Box-Supervised Video Instance Segmentation [30.453433078039133]
Box-supervised Video Instance (VIS) methods have emerged as a viable solution to mitigate the labor-intensive annotation process.
We introduce a novel approach that aims at harnessing instance box annotations to generate high-quality instance pseudo masks.
Our PM-VIS model, trained with high-quality pseudo mask annotations, demonstrates strong ability in instance mask prediction.
arXiv Detail & Related papers (2024-04-22T04:25:02Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Temporal-aware Hierarchical Mask Classification for Video Semantic
Segmentation [62.275143240798236]
Video semantic segmentation dataset has limited categories per video.
Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training.
Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
arXiv Detail & Related papers (2023-09-14T20:31:06Z) - MGMAE: Motion Guided Masking for Video Masked Autoencoding [34.80832206608387]
Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE.
Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume.
We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE.
arXiv Detail & Related papers (2023-08-21T15:39:41Z) - Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised
Framework with Spatio-Temporal Collaboration [13.284951215948052]
We present a novel weakly supervised framework with textbfS-patiotextbfTemporal textbfClaboration for instance textbfSegmentation in videos.
Our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN.
arXiv Detail & Related papers (2022-12-15T02:44:13Z) - SVFormer: Semi-supervised Video Transformer for Action Recognition [88.52042032347173]
We introduce SVFormer, which adopts a steady pseudo-labeling framework to cope with unlabeled video samples.
In addition, we propose a temporal warping to cover the complex temporal variation in videos.
In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400.
arXiv Detail & Related papers (2022-11-23T18:58:42Z) - Robust Online Video Instance Segmentation with Track Queries [15.834703258232002]
We propose a fully online transformer-based video instance segmentation model that performs comparably to top offline methods on the YouTube-VIS 2019 benchmark.
We show that, when combined with a strong enough image segmentation architecture, track queries can exhibit impressive accuracy while not being constrained to short videos.
arXiv Detail & Related papers (2022-11-16T18:50:14Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.