Exploring Stochastic Autoregressive Image Modeling for Visual
Representation
- URL: http://arxiv.org/abs/2212.01610v1
- Date: Sat, 3 Dec 2022 13:04:29 GMT
- Title: Exploring Stochastic Autoregressive Image Modeling for Visual
Representation
- Authors: Yu Qi, Fan Yang, Yousong Zhu, Yufei Liu, Liwei Wu, Rui Zhao, Wei Li
- Abstract summary: We propose a novel autoregressive image modeling (named SAIM) by the two simple designs.
By introducing prediction and the parallel encoder-decoder, SAIM significantly improve the performance of autoregressive image modeling.
Our method achieves the best accuracy (83.9%) on the vanilla ViT-Base model among methods using only ImageNet-1K data.
- Score: 24.582376834198403
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoregressive language modeling (ALM) have been successfully used in
self-supervised pre-training in Natural language processing (NLP). However,
this paradigm has not achieved comparable results with other self-supervised
approach in computer vision (e.g., contrastive learning, mask image modeling).
In this paper, we try to find the reason why autoregressive modeling does not
work well on vision tasks. To tackle this problem, we fully analyze the
limitation of visual autoregressive methods and proposed a novel stochastic
autoregressive image modeling (named SAIM) by the two simple designs. First, we
employ stochastic permutation strategy to generate effective and robust image
context which is critical for vision tasks. Second, we create a parallel
encoder-decoder training process in which the encoder serves a similar role to
the standard vision transformer focus on learning the whole contextual
information, and meanwhile the decoder predicts the content of the current
position, so that the encoder and decoder can reinforce each other. By
introducing stochastic prediction and the parallel encoder-decoder, SAIM
significantly improve the performance of autoregressive image modeling. Our
method achieves the best accuracy (83.9%) on the vanilla ViT-Base model among
methods using only ImageNet-1K data. Transfer performance in downstream tasks
also show that our model achieves competitive performance.
Related papers
- Corner-to-Center Long-range Context Model for Efficient Learned Image
Compression [70.0411436929495]
In the framework of learned image compression, the context model plays a pivotal role in capturing the dependencies among latent representations.
We propose the textbfCorner-to-Center transformer-based Context Model (C$3$M) designed to enhance context and latent predictions.
In addition, to enlarge the receptive field in the analysis and synthesis transformation, we use the Long-range Crossing Attention Module (LCAM) in the encoder/decoder.
arXiv Detail & Related papers (2023-11-29T21:40:28Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - MAGE: MAsked Generative Encoder to Unify Representation Learning and
Image Synthesis [33.46831766206675]
MAsked Generative (MAGE) is first framework to unify SOTA image generation and self-supervised representation learning.
Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs.
On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation.
arXiv Detail & Related papers (2022-11-16T18:59:02Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - Semantic Image Synthesis with Semantically Coupled VQ-Model [42.19799555533789]
We conditionally synthesize the latent space from a vector quantized model (VQ-model) pre-trained to autoencode images.
We show that our model improves semantic image synthesis using autoregressive models on popular semantic image datasets ADE20k, Cityscapes and COCO-Stuff.
arXiv Detail & Related papers (2022-09-06T14:37:01Z) - Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning.
We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z) - Neural Data-Dependent Transform for Learned Image Compression [72.86505042102155]
We build a neural data-dependent transform and introduce a continuous online mode decision mechanism to jointly optimize the coding efficiency for each individual image.
The experimental results show the effectiveness of the proposed neural-syntax design and the continuous online mode decision mechanism.
arXiv Detail & Related papers (2022-03-09T14:56:48Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - Text-to-Image Generation with Attention Based Recurrent Neural Networks [1.2599533416395765]
We develop a tractable and stable caption-based image generation model.
Experimentations are performed on Microsoft datasets.
Results show that the proposed model performs better than contemporary approaches.
arXiv Detail & Related papers (2020-01-18T12:19:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.