Siamese Masked Autoencoders
- URL: http://arxiv.org/abs/2305.14344v1
- Date: Tue, 23 May 2023 17:59:46 GMT
- Title: Siamese Masked Autoencoders
- Authors: Agrim Gupta, Jiajun Wu, Jia Deng, Li Fei-Fei
- Abstract summary: We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos.
SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them.
It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
- Score: 76.35448665609998
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Establishing correspondence between images or scenes is a significant
challenge in computer vision, especially given occlusions, viewpoint changes,
and varying object appearances. In this paper, we present Siamese Masked
Autoencoders (SiamMAE), a simple extension of Masked Autoencoders (MAE) for
learning visual correspondence from videos. SiamMAE operates on pairs of
randomly sampled video frames and asymmetrically masks them. These frames are
processed independently by an encoder network, and a decoder composed of a
sequence of cross-attention layers is tasked with predicting the missing
patches in the future frame. By masking a large fraction ($95\%$) of patches in
the future frame while leaving the past frame unchanged, SiamMAE encourages the
network to focus on object motion and learn object-centric representations.
Despite its conceptual simplicity, features learned via SiamMAE outperform
state-of-the-art self-supervised methods on video object segmentation, pose
keypoint propagation, and semantic part propagation tasks. SiamMAE achieves
competitive results without relying on data augmentation, handcrafted
tracking-based pretext tasks, or other techniques to prevent representational
collapse.
Related papers
- CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders [6.159948396712944]
CrossVideoMAE learns both video-level and frame-level richtemporal representations and semantic attributes.
Our method integrates mutualtemporal information from videos with spatial information from sampled frames.
This is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner.
arXiv Detail & Related papers (2025-02-08T06:15:39Z) - FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors [64.54220123913154]
We introduce FramePainter as an efficient instantiation of image-to-video generation problem.
It only uses a lightweight sparse control encoder to inject editing signals.
It domainantly outperforms previous state-of-the-art methods with far less training data.
arXiv Detail & Related papers (2025-01-14T16:09:16Z) - Text-Guided Video Masked Autoencoder [12.321239366215426]
We introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions.
We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE.
arXiv Detail & Related papers (2024-08-01T17:58:19Z) - Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders [89.12558126877532]
We propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE.
Our method exclusively considers pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video.
CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches.
arXiv Detail & Related papers (2024-03-26T16:04:19Z) - Concatenated Masked Autoencoders as Spatial-Temporal Learner [6.475592804311682]
We introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for self-supervised video representation learning.
We propose a new data augmentation strategy, Video-Reverse (ViRe), which uses reversed video frames as the model's reconstruction targets.
arXiv Detail & Related papers (2023-11-02T03:08:26Z) - Masked Motion Encoding for Self-Supervised Video Representation Learning [84.24773072241945]
We present Masked Motion MME, a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues.
Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions.
Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details.
arXiv Detail & Related papers (2022-10-12T11:19:55Z) - Differentiable Soft-Masked Attention [115.5770357189209]
"Differentiable Soft-Masked Attention" is used for the task of WeaklySupervised Video Object.
We develop a transformer-based network for training, but can also benefit from cycle consistency training on a video with just one annotated frame.
arXiv Detail & Related papers (2022-06-01T02:05:13Z) - Context Autoencoder for Self-Supervised Representation Learning [64.63908944426224]
We pretrain an encoder by making predictions in the encoded representation space.
The network is an encoder-regressor-decoder architecture.
We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks.
arXiv Detail & Related papers (2022-02-07T09:33:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.