Improve Supervised Representation Learning with Masked Image Modeling
- URL: http://arxiv.org/abs/2312.00950v1
- Date: Fri, 1 Dec 2023 22:03:25 GMT
- Title: Improve Supervised Representation Learning with Masked Image Modeling
- Authors: Kaifeng Chen, Daniel Salz, Huiwen Chang, Kihyuk Sohn, Dilip Krishnan,
Mojtaba Seyedhosseini
- Abstract summary: We propose a simple yet effective setup that can easily integrate masked image modeling into existing supervised training paradigms.
We show with minimal change in architecture and no overhead in inference that this setup is able to improve the quality of the learned representations.
- Score: 30.30649867772395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training visual embeddings with labeled data supervision has been the de
facto setup for representation learning in computer vision. Inspired by recent
success of adopting masked image modeling (MIM) in self-supervised
representation learning, we propose a simple yet effective setup that can
easily integrate MIM into existing supervised training paradigms. In our
design, in addition to the original classification task applied to a vision
transformer image encoder, we add a shallow transformer-based decoder on top of
the encoder and introduce an MIM task which tries to reconstruct image tokens
based on masked image inputs. We show with minimal change in architecture and
no overhead in inference that this setup is able to improve the quality of the
learned representations for downstream tasks such as classification, image
retrieval, and semantic segmentation. We conduct a comprehensive study and
evaluation of our setup on public benchmarks. On ImageNet-1k, our ViT-B/14
model achieves 81.72% validation accuracy, 2.01% higher than the baseline
model. On K-Nearest-Neighbor image retrieval evaluation with ImageNet-1k, the
same model outperforms the baseline by 1.32%. We also show that this setup can
be easily scaled to larger models and datasets. Code and checkpoints will be
released.
Related papers
- Adapting LLaMA Decoder to Vision Transformer [65.47663195233802]
This work examines whether decoder-only Transformers such as LLaMA can be adapted to the computer vision field.
We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue.
We develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior.
arXiv Detail & Related papers (2024-04-10T06:30:08Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - MULLER: Multilayer Laplacian Resizer for Vision [16.67232499096539]
We present an extremely lightweight multilayer Laplacian resizer with only a handful of trainable parameters, dubbed MULLER resizer.
We show that MULLER can be easily plugged into various training pipelines, and it effectively boosts the performance of the underlying vision task with little to no extra cost.
arXiv Detail & Related papers (2023-04-06T04:39:21Z) - MAGE: MAsked Generative Encoder to Unify Representation Learning and
Image Synthesis [33.46831766206675]
MAsked Generative (MAGE) is first framework to unify SOTA image generation and self-supervised representation learning.
Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs.
On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation.
arXiv Detail & Related papers (2022-11-16T18:59:02Z) - A Unified View of Masked Image Modeling [117.79456335844439]
Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers.
We introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions.
Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods.
arXiv Detail & Related papers (2022-10-19T14:59:18Z) - Denoising Masked AutoEncoders are Certifiable Robust Vision Learners [37.04863068273281]
We propose a new self-supervised method, which is called Denoising Masked AutoEncoders (DMAE)
DMAE corrupts each image by adding Gaussian noises to each pixel value and randomly masking several patches.
A Transformer-based encoder-decoder model is then trained to reconstruct the original image from the corrupted one.
arXiv Detail & Related papers (2022-10-10T12:37:59Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training.
CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens.
After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.