Self-Supervised Video Object Segmentation by Motion-Aware Mask
Propagation
- URL: http://arxiv.org/abs/2107.12569v1
- Date: Tue, 27 Jul 2021 03:07:56 GMT
- Title: Self-Supervised Video Object Segmentation by Motion-Aware Mask
Propagation
- Authors: Bo Miao, Mohammed Bennamoun, Yongsheng Gao, Ajmal Mian
- Abstract summary: We coined a self-supervised motion-aware matching method for semi-supervised video object segmentation.
We show that MAMP achieves state-of-the-art performance with stronger generalization ability compared to existing self-supervised methods.
- Score: 52.8407961172098
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a self-supervised spatio-temporal matching method coined
Motion-Aware Mask Propagation (MAMP) for semi-supervised video object
segmentation. During training, MAMP leverages the frame reconstruction task to
train the model without the need for annotations. During inference, MAMP
extracts high-resolution features from each frame to build a memory bank from
the features as well as the predicted masks of selected past frames. MAMP then
propagates the masks from the memory bank to subsequent frames according to our
motion-aware spatio-temporal matching module, also proposed in this paper.
Evaluation on DAVIS-2017 and YouTube-VOS datasets show that MAMP achieves
state-of-the-art performance with stronger generalization ability compared to
existing self-supervised methods, i.e. 4.9\% higher mean
$\mathcal{J}\&\mathcal{F}$ on DAVIS-2017 and 4.85\% higher mean
$\mathcal{J}\&\mathcal{F}$ on the unseen categories of YouTube-VOS than the
nearest competitor. Moreover, MAMP performs on par with many supervised video
object segmentation methods. Our code is available at:
\url{https://github.com/bo-miao/MAMP}.
Related papers
- Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence.
We propose an efficient mask propagation framework for VSS, called SSSS.
Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z) - Motion-Guided Masking for Spatiotemporal Representation Learning [16.9547105658246]
We propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time.
On two challenging large-scale video benchmarks, we equip video MAE with our MGM and achieve up to +$1.3%$ improvement compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2023-08-24T17:58:04Z) - MGMAE: Motion Guided Masking for Video Masked Autoencoding [34.80832206608387]
Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE.
Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume.
We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE.
arXiv Detail & Related papers (2023-08-21T15:39:41Z) - Siamese Masked Autoencoders [76.35448665609998]
We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos.
SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them.
It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
arXiv Detail & Related papers (2023-05-23T17:59:46Z) - DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking
Tasks [76.24996889649744]
Masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS)
We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.
Our model sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets.
arXiv Detail & Related papers (2023-04-02T16:40:42Z) - Mask-Free Video Instance Segmentation [102.50936366583106]
Video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets.
We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state.
Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection.
arXiv Detail & Related papers (2023-03-28T11:48:07Z) - Efficient Video Object Segmentation with Compressed Video [36.192735485675286]
We propose an efficient framework for semi-supervised video object segmentation by exploiting the temporal redundancy of the video.
Our method performs inference on selected vectors and makes predictions for other frames via propagation based on motion and residuals from the compressed video bitstream.
With STM with top-k filtering as our base model, we achieved highly competitive results on DAVIS16 and YouTube-VOS with substantial speedups of up to 4.9X with little loss in accuracy.
arXiv Detail & Related papers (2021-07-26T12:57:04Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.