Not All Image Regions Matter: Masked Vector Quantization for
Autoregressive Image Generation
- URL: http://arxiv.org/abs/2305.13607v1
- Date: Tue, 23 May 2023 02:15:53 GMT
- Title: Not All Image Regions Matter: Masked Vector Quantization for
Autoregressive Image Generation
- Authors: Mengqi Huang, Zhendong Mao, Quan Wang, Yongdong Zhang
- Abstract summary: Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook.
We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
- Score: 78.13793505707952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing autoregressive models follow the two-stage generation paradigm that
first learns a codebook in the latent space for image reconstruction and then
completes the image generation autoregressively based on the learned codebook.
However, existing codebook learning simply models all local region information
of images without distinguishing their different perceptual importance, which
brings redundancy in the learned codebook that not only limits the next stage's
autoregressive model's ability to model important structure but also results in
high training cost and slow generation speed. In this study, we borrow the idea
of importance perception from classical image coding theory and propose a novel
two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) and
Stackformer, to relieve the model from modeling redundancy. Specifically,
MQ-VAE incorporates an adaptive mask module for masking redundant region
features before quantization and an adaptive de-mask module for recovering the
original grid image feature map to faithfully reconstruct the original images
after quantization. Then, Stackformer learns to predict the combination of the
next code and its position in the feature map. Comprehensive experiments on
various image generation validate our effectiveness and efficiency. Code will
be released at https://github.com/CrossmodalGroup/MaskedVectorQuantization.
Related papers
- VQAttack: Transferable Adversarial Attacks on Visual Question Answering
via Pre-trained Models [58.21452697997078]
We propose a novel VQAttack model, which can generate both image and text perturbations with the designed modules.
Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack.
arXiv Detail & Related papers (2024-02-16T21:17:42Z) - Reinforcement Learning from Diffusion Feedback: Q* for Image Search [2.5835347022640254]
We present two models for image generation using model-agnostic learning.
RLDF is a singular approach for visual imitation through prior-preserving reward function guidance.
It generates high-quality images over varied domains showcasing class-consistency and strong visual diversity.
arXiv Detail & Related papers (2023-11-27T09:20:12Z) - Towards Accurate Image Coding: Improved Autoregressive Image Generation
with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm.
We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - MAGE: MAsked Generative Encoder to Unify Representation Learning and
Image Synthesis [33.46831766206675]
MAsked Generative (MAGE) is first framework to unify SOTA image generation and self-supervised representation learning.
Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs.
On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation.
arXiv Detail & Related papers (2022-11-16T18:59:02Z) - Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training.
Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance.
We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z) - Meta Internal Learning [88.68276505511922]
Internal learning for single-image generation is a framework, where a generator is trained to produce novel images based on a single image.
We propose a meta-learning approach that enables training over a collection of images, in order to model the internal statistics of the sample image more effectively.
Our results show that the models obtained are as suitable as single-image GANs for many common image applications.
arXiv Detail & Related papers (2021-10-06T16:27:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.