Vector-quantized Image Modeling with Improved VQGAN
- URL: http://arxiv.org/abs/2110.04627v1
- Date: Sat, 9 Oct 2021 18:36:00 GMT
- Title: Vector-quantized Image Modeling with Improved VQGAN
- Authors: Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin,
Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu
- Abstract summary: We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively.
We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
- Score: 93.8443646643864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretraining language models with next-token prediction on massive text
corpora has delivered phenomenal zero-shot, few-shot, transfer learning and
multi-tasking capabilities on both generative and discriminative language
tasks. Motivated by this success, we explore a Vector-quantized Image Modeling
(VIM) approach that involves pretraining a Transformer to predict rasterized
image tokens autoregressively. The discrete image tokens are encoded from a
learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple
improvements over vanilla VQGAN from architecture to codebook learning,
yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN
further improves vector-quantized image modeling tasks, including
unconditional, class-conditioned image generation and unsupervised
representation learning. When trained on ImageNet at 256x256 resolution, we
achieve Inception Score (IS) of 175.1 and Fr'echet Inception Distance (FID) of
4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and
17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised
pretraining, we further evaluate the pretrained Transformer by averaging
intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained
VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 72.2%
for a similar model size. ViM-L also outperforms iGPT-XL which is trained with
extra web image data and larger model size.
Related papers
- ViTamin: Designing Scalable Vision Models in the Vision-Language Era [26.878662961209997]
Vision Transformers (ViTs) remain the default choice for the image encoder.
ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy.
ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy.
arXiv Detail & Related papers (2024-04-02T17:40:29Z) - VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - Rejuvenating image-GPT as Strong Visual Representation Learners [28.77567067712619]
This paper enhances image-GPT, one of the pioneering works that introduce autoregressive pretraining to predict the next pixels.
We shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content.
Experiments showcase that D-iGPT excels as a strong learner of visual representations.
arXiv Detail & Related papers (2023-12-04T18:59:20Z) - MAGE: MAsked Generative Encoder to Unify Representation Learning and
Image Synthesis [33.46831766206675]
MAsked Generative (MAGE) is first framework to unify SOTA image generation and self-supervised representation learning.
Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs.
On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation.
arXiv Detail & Related papers (2022-11-16T18:59:02Z) - MILAN: Masked Image Pretraining on Language Assisted Representation [30.24762638226569]
In this work, we propose masked image pretraining on language assisted representation, dubbed as MILAN.
Instead of predicting raw pixels or low level features, our pretraining objective is to reconstruct the image features with substantial semantic signals.
Experimental results demonstrate that MILAN delivers higher accuracy than the previous works.
arXiv Detail & Related papers (2022-08-11T21:58:36Z) - MVP: Multimodality-guided Visual Pre-training [215.11351064601303]
Masked image modeling (MIM) has become a promising direction for visual pre-training.
In this paper, we introduce guidance from other modalities and validate that such additional knowledge leads to impressive gains for visual pre-training.
The proposed approach is named Multimodality-guided Visual Pre-training (MVP), in which we replace the tokenizer with the vision branch of CLIP, a vision-language model pre-trained on 400 million image-text pairs.
arXiv Detail & Related papers (2022-03-10T06:11:20Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - Pre-Trained Image Processing Transformer [95.93031793337613]
We develop a new pre-trained model, namely, image processing transformer (IPT)
We present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs.
IPT model is trained on these images with multi-heads and multi-tails.
arXiv Detail & Related papers (2020-12-01T09:42:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.