Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders
- URL: http://arxiv.org/abs/2403.17823v2
- Date: Thu, 18 Jul 2024 08:56:11 GMT
- Title: Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders
- Authors: Alexandre Eymaƫl, Renaud Vandeghen, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck,
- Abstract summary: We propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE.
Our method exclusively considers pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video.
CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches.
- Score: 89.12558126877532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised pre-training of image encoders is omnipresent in the literature, particularly following the introduction of Masked autoencoders (MAE). Current efforts attempt to learn object-centric representations from motion in videos. In particular, SiamMAE recently introduced a Siamese network, training a shared-weight encoder from two frames of a video with a high asymmetric masking ratio (95%). In this work, we propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE. Our method specifically differs by exclusively considering pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video. CropMAE therefore alleviates the need for video datasets, while maintaining competitive performances and drastically reducing pre-training and learning time. Furthermore, we demonstrate that CropMAE learns similar object-centric representations without explicit motion, showing that current self-supervised learning methods do not learn such representations from explicit object motion, but rather thanks to the implicit image transformations that occur between the two views. Finally, CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches. Our code is available at https://github.com/alexandre-eymael/CropMAE.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Concatenated Masked Autoencoders as Spatial-Temporal Learner [6.475592804311682]
We introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for self-supervised video representation learning.
We propose a new data augmentation strategy, Video-Reverse (ViRe), which uses reversed video frames as the model's reconstruction targets.
arXiv Detail & Related papers (2023-11-02T03:08:26Z) - Motion-Guided Masking for Spatiotemporal Representation Learning [16.9547105658246]
We propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time.
On two challenging large-scale video benchmarks, we equip video MAE with our MGM and achieve up to +$1.3%$ improvement compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2023-08-24T17:58:04Z) - Siamese Masked Autoencoders [76.35448665609998]
We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos.
SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them.
It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
arXiv Detail & Related papers (2023-05-23T17:59:46Z) - DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking
Tasks [76.24996889649744]
Masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS)
We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.
Our model sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets.
arXiv Detail & Related papers (2023-04-02T16:40:42Z) - Masked Motion Encoding for Self-Supervised Video Representation Learning [84.24773072241945]
We present Masked Motion MME, a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues.
Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions.
Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details.
arXiv Detail & Related papers (2022-10-12T11:19:55Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.