Concatenated Masked Autoencoders as Spatial-Temporal Learner
- URL: http://arxiv.org/abs/2311.00961v2
- Date: Thu, 14 Dec 2023 08:15:13 GMT
- Title: Concatenated Masked Autoencoders as Spatial-Temporal Learner
- Authors: Zhouqiang Jiang, Bowen Wang, Tong Xiang, Zhaofeng Niu, Hong Tang,
Guangshun Li, Liangzhi Li
- Abstract summary: We introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for self-supervised video representation learning.
We propose a new data augmentation strategy, Video-Reverse (ViRe), which uses reversed video frames as the model's reconstruction targets.
- Score: 6.475592804311682
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning representations from videos requires understanding continuous motion
and visual correspondences between frames. In this paper, we introduce the
Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for
self-supervised video representation learning. For the input sequence of video
frames, CatMAE keeps the initial frame unchanged while applying substantial
masking (95%) to subsequent frames. The encoder in CatMAE is responsible for
encoding visible patches for each frame individually; subsequently, for each
masked frame, the decoder leverages visible patches from both previous and
current frames to reconstruct the original image. Our proposed method enables
the model to estimate the motion information between visible patches, match the
correspondences between preceding and succeeding frames, and ultimately learn
the evolution of scenes. Furthermore, we propose a new data augmentation
strategy, Video-Reverse (ViRe), which uses reversed video frames as the model's
reconstruction targets. This further encourages the model to utilize continuous
motion details and correspondences to complete the reconstruction, thereby
enhancing the model's capabilities. Compared to the most advanced pre-training
methods, CatMAE achieves a leading level in video segmentation tasks and action
recognition tasks.
Related papers
- Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders [89.12558126877532]
We propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE.
Our method exclusively considers pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video.
CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches.
arXiv Detail & Related papers (2024-03-26T16:04:19Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [24.691270610091554]
In this paper, we aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.
We obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
arXiv Detail & Related papers (2023-08-15T08:54:25Z) - Siamese Masked Autoencoders [76.35448665609998]
We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos.
SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them.
It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
arXiv Detail & Related papers (2023-05-23T17:59:46Z) - DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking
Tasks [76.24996889649744]
Masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS)
We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.
Our model sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets.
arXiv Detail & Related papers (2023-04-02T16:40:42Z) - Masked Motion Encoding for Self-Supervised Video Representation Learning [84.24773072241945]
We present Masked Motion MME, a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues.
Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions.
Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details.
arXiv Detail & Related papers (2022-10-12T11:19:55Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Siamese Network with Interactive Transformer for Video Object
Segmentation [34.202137199782804]
We propose a network with a specifically designed interactive transformer, called SITVOS, to enable effective context propagation from historical to current frames.
We employ the backbone architecture to extract backbone features of both past and current frames, which enables feature reuse and is more efficient than existing methods.
arXiv Detail & Related papers (2021-12-28T03:38:17Z) - Cycle-Contrast for Self-Supervised Video Representation Learning [10.395615031496064]
We present Cycle-Contrastive Learning (CCL), a novel self-supervised method for learning video representation.
In our method, the frame and video representations are learned from a single network based on an R3D architecture.
We demonstrate that the video representation learned by CCL can be transferred well to downstream tasks of video understanding.
arXiv Detail & Related papers (2020-10-28T08:27:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.