MAGE: MAsked Generative Encoder to Unify Representation Learning and
Image Synthesis
- URL: http://arxiv.org/abs/2211.09117v2
- Date: Thu, 29 Jun 2023 15:30:25 GMT
- Title: MAGE: MAsked Generative Encoder to Unify Representation Learning and
Image Synthesis
- Authors: Tianhong Li, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi,
Dilip Krishnan
- Abstract summary: MAsked Generative (MAGE) is first framework to unify SOTA image generation and self-supervised representation learning.
Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs.
On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation.
- Score: 33.46831766206675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative modeling and representation learning are two key tasks in computer
vision. However, these models are typically trained independently, which
ignores the potential for each task to help the other, and leads to training
and model maintenance overheads. In this work, we propose MAsked Generative
Encoder (MAGE), the first framework to unify SOTA image generation and
self-supervised representation learning. Our key insight is that using variable
masking ratios in masked image modeling pre-training can allow generative
training (very high masking ratio) and representation learning (lower masking
ratio) under the same training framework. Inspired by previous generative
models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs
and outputs, combining this with masking. We can further improve the
representation by adding a contrastive loss to the encoder output. We
extensively evaluate the generation and representation learning capabilities of
MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of
class-unconditional image generation and 78.9% top-1 accuracy for linear
probing, achieving state-of-the-art performance in both image generation and
representation learning. Code is available at https://github.com/LTH14/mage.
Related papers
- MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Not All Image Regions Matter: Masked Vector Quantization for
Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook.
We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - EfficientTrain: Exploring Generalized Curriculum Learning for Training
Visual Backbones [80.662250618795]
This paper presents a new curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers)
As an off-the-shelf method, it reduces the wall-time training cost of a wide variety of popular models by >1.5x on ImageNet-1K/22K without sacrificing accuracy.
arXiv Detail & Related papers (2022-11-17T17:38:55Z) - Masked Contrastive Representation Learning [6.737710830712818]
This work presents Masked Contrastive Representation Learning (MACRL) for self-supervised visual pre-training.
We adopt an asymmetric setting for the siamese network (i.e., encoder-decoder structure in both branches), where one branch with higher mask ratio and stronger data augmentation, while the other adopts weaker data corruptions.
In our experiments, MACRL presents superior results on various vision benchmarks, including CIFAR-10, CIFAR-100, Tiny-ImageNet, and two other ImageNet subsets.
arXiv Detail & Related papers (2022-11-11T05:32:28Z) - The Devil is in the Frequency: Geminated Gestalt Autoencoder for
Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training.
Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z) - Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training.
CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens.
After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.