A simple, efficient and scalable contrastive masked autoencoder for
learning visual representations
- URL: http://arxiv.org/abs/2210.16870v1
- Date: Sun, 30 Oct 2022 16:21:22 GMT
- Title: A simple, efficient and scalable contrastive masked autoencoder for
learning visual representations
- Authors: Shlok Mishra, Joshua Robinson, Huiwen Chang, David Jacobs, Aaron
Sarna, Aaron Maschinot, Dilip Krishnan
- Abstract summary: We introduce CAN, a simple, efficient and scalable method for self-supervised learning of visual representations.
Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models.
- Score: 21.440853288058452
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce CAN, a simple, efficient and scalable method for self-supervised
learning of visual representations. Our framework is a minimal and conceptually
clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N)
the noise prediction approach used in diffusion models. The learning mechanisms
are complementary to one another: contrastive learning shapes the embedding
space across a batch of image samples; masked autoencoders focus on
reconstruction of the low-frequency spatial correlations in a single image
sample; and noise prediction encourages the reconstruction of the
high-frequency components of an image. The combined approach results in a
robust, scalable and simple-to-implement algorithm. The training process is
symmetric, with 50% of patches in both views being masked at random, yielding a
considerable efficiency improvement over prior contrastive learning methods.
Extensive empirical studies demonstrate that CAN achieves strong downstream
performance under both linear and finetuning evaluations on transfer learning
and robustness tasks. CAN outperforms MAE and SimCLR when pre-training on
ImageNet, but is especially useful for pre-training on larger uncurated
datasets such as JFT-300M: for linear probe on ImageNet, CAN achieves 75.4%
compared to 73.4% for SimCLR and 64.1% for MAE. The finetuned performance on
ImageNet of our ViT-L model is 86.1%, compared to 85.5% for SimCLR, and 85.4%
for MAE. The overall FLOPs load of SimCLR is 70% higher than CAN for ViT-L
models.
Related papers
- Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [72.48325960659822]
One main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations.
We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders.
The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs.
arXiv Detail & Related papers (2024-10-09T14:34:53Z) - ASteISR: Adapting Single Image Super-resolution Pre-trained Model for Efficient Stereo Image Super-resolution [6.154796157653035]
We propose a method for transferring a pre-trained single-image super-resolution (SISR) transformer network to the domain of stereo image super-resolution (SteISR)
Specifically, we introduce the concept of stereo adapters and spatial adapters which are incorporated into the pre-trained SISR transformer network.
By adopting this training method, we enhance the ability of the SISR model to accurately infer stereo images by 0.79dB on the Flickr1024 dataset.
arXiv Detail & Related papers (2024-07-04T03:12:05Z) - Inter-Instance Similarity Modeling for Contrastive Learning [22.56316444504397]
We propose a novel image mix method, PatchMix, for contrastive learning in Vision Transformer (ViT)
Compared to the existing sample mix methods, our PatchMix can flexibly and efficiently mix more than two images.
Our proposed method significantly outperforms the previous state-of-the-art on both ImageNet-1K and CIFAR datasets.
arXiv Detail & Related papers (2023-06-21T13:03:47Z) - Transferring Pre-trained Multimodal Representations with Cross-modal
Similarity Matching [49.730741713652435]
In this paper, we propose a method that can effectively transfer the representations of a large pre-trained multimodal model into a small target model.
For unsupervised transfer, we introduce cross-modal similarity matching (CSM) that enables a student model to learn the representations of a teacher model.
To better encode the text prompts, we design context-based prompt augmentation (CPA) that can alleviate the lexical ambiguity of input text prompts.
arXiv Detail & Related papers (2023-01-07T17:24:11Z) - EEG-based Image Feature Extraction for Visual Classification using Deep
Learning [0.0]
We develop an efficient way of encoding EEG signals as images to facilitate a more subtle understanding of brain signals with deep learning models.
Our image classification approach with combined EEG features achieved an accuracy of 82% compared to the slightly better accuracy of a pure deep learning approach.
arXiv Detail & Related papers (2022-09-27T00:50:56Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via
Feature Distillation [42.37533586611174]
Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances.
In this paper, we show that the inferior fine-tuning performance of pre-training approaches can be significantly improved by a simple post-processing.
arXiv Detail & Related papers (2022-05-27T17:59:36Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - With a Little Help from My Friends: Nearest-Neighbor Contrastive
Learning of Visual Representations [87.72779294717267]
Using the nearest-neighbor as positive in contrastive losses improves performance significantly on ImageNet classification.
We demonstrate empirically that our method is less reliant on complex data augmentations.
arXiv Detail & Related papers (2021-04-29T17:56:08Z) - A Simple Framework for Contrastive Learning of Visual Representations [116.37752766922407]
This paper presents SimCLR: a simple framework for contrastive learning of visual representations.
We show that composition of data augmentations plays a critical role in defining effective predictive tasks.
We are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet.
arXiv Detail & Related papers (2020-02-13T18:50:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.