VideoMAC: Video Masked Autoencoders Meet ConvNets
- URL: http://arxiv.org/abs/2402.19082v1
- Date: Thu, 29 Feb 2024 12:09:25 GMT
- Title: VideoMAC: Video Masked Autoencoders Meet ConvNets
- Authors: Gensheng Pei, Tao Chen, Xiruo Jiang, Huafeng Liu, Zeren Sun, Yazhou
Yao
- Abstract summary: VideoMAC employs symmetric masking on randomly sampled pairs of video frames.
We present a simple yet effective masked video modeling (MVM) approach, a dual encoder architecture.
VideoMAC, empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders, outperforms ViT-based approaches on downstream tasks.
- Score: 26.723998063596635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the advancement of self-supervised learning techniques, like masked
autoencoders (MAE), has greatly influenced visual representation learning for
images and videos. Nevertheless, it is worth noting that the predominant
approaches in existing masked image / video modeling rely excessively on
resource-intensive vision transformers (ViTs) as the feature encoder. In this
paper, we propose a new approach termed as \textbf{VideoMAC}, which combines
video masked autoencoders with resource-friendly ConvNets. Specifically,
VideoMAC employs symmetric masking on randomly sampled pairs of video frames.
To prevent the issue of mask pattern dissipation, we utilize ConvNets which are
implemented with sparse convolutional operators as encoders. Simultaneously, we
present a simple yet effective masked video modeling (MVM) approach, a dual
encoder architecture comprising an online encoder and an exponential moving
average target encoder, aimed to facilitate inter-frame reconstruction
consistency in videos. Additionally, we demonstrate that VideoMAC, empowering
classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the
benefits of MVM, outperforms ViT-based approaches on downstream tasks,
including video object segmentation (+\textbf{5.2\%} / \textbf{6.4\%}
$\mathcal{J}\&\mathcal{F}$), body part propagation (+\textbf{6.3\%} /
\textbf{3.1\%} mIoU), and human pose tracking (+\textbf{10.2\%} /
\textbf{11.1\%} PCK@0.1).
Related papers
- Text-Guided Video Masked Autoencoder [12.321239366215426]
We introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions.
We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE.
arXiv Detail & Related papers (2024-08-01T17:58:19Z) - Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence.
We propose an efficient mask propagation framework for VSS, called SSSS.
Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z) - Regress Before Construct: Regress Autoencoder for Point Cloud
Self-supervised Learning [18.10704604275133]
Masked Autoencoders (MAE) have demonstrated promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Point Regress AutoEncoder (Point-RAE), a new scheme for regressive autoencoders for point cloud self-supervised learning.
Our approach is efficient during pre-training and generalizes well on various downstream tasks.
arXiv Detail & Related papers (2023-09-25T17:23:33Z) - Siamese Masked Autoencoders [76.35448665609998]
We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos.
SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them.
It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
arXiv Detail & Related papers (2023-05-23T17:59:46Z) - VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models.
We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z) - Contrastive Masked Autoencoders for Self-Supervised Video Hashing [54.636976693527636]
Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision.
We propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding.
arXiv Detail & Related papers (2022-11-21T06:48:14Z) - A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs)
The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved.
We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z) - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z) - MetaSCI: Scalable and Adaptive Reconstruction for Video Compressive
Sensing [21.243762976995544]
Video snapshot compressive imaging (SCI) is a promising system, where the video frames are coded by different masks and then compressed to a snapshot measurement.
We develop a Meta Modulated Convolutional Network for SCI reconstruction, dubbed MetaSCI.
arXiv Detail & Related papers (2021-03-02T14:53:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.