Autoregressive Image Generation using Residual Quantization
- URL: http://arxiv.org/abs/2203.01941v1
- Date: Thu, 3 Mar 2022 11:44:46 GMT
- Title: Autoregressive Image Generation using Residual Quantization
- Authors: Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, Wook-Shin Han
- Abstract summary: We propose a two-stage framework to generate high-resolution images.
The framework consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer.
Our approach has a significantly faster sampling speed than previous AR models to generate high-quality images.
- Score: 40.04085054791994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For autoregressive (AR) modeling of high-resolution images, vector
quantization (VQ) represents an image as a sequence of discrete codes. A short
sequence length is important for an AR model to reduce its computational costs
to consider long-range interactions of codes. However, we postulate that
previous VQ cannot shorten the code sequence and generate high-fidelity images
together in terms of the rate-distortion trade-off. In this study, we propose
the two-stage framework, which consists of Residual-Quantized VAE (RQ-VAE) and
RQ-Transformer, to effectively generate high-resolution images. Given a fixed
codebook size, RQ-VAE can precisely approximate a feature map of an image and
represent the image as a stacked map of discrete codes. Then, RQ-Transformer
learns to predict the quantized feature vector at the next position by
predicting the next stack of codes. Thanks to the precise approximation of
RQ-VAE, we can represent a 256$\times$256 image as 8$\times$8 resolution of the
feature map, and RQ-Transformer can efficiently reduce the computational costs.
Consequently, our framework outperforms the existing AR models on various
benchmarks of unconditional and conditional image generation. Our approach also
has a significantly faster sampling speed than previous AR models to generate
high-quality images.
Related papers
- PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution [87.89013794655207]
Diffusion-based image super-resolution (SR) models have shown superior performance at the cost of multiple denoising steps.
We propose a novel post-training quantization approach with adaptive scale in one-step diffusion (OSD) image SR, PassionSR.
Our PassionSR achieves significant advantages over recent leading low-bit quantization methods for image SR.
arXiv Detail & Related papers (2024-11-26T04:49:42Z) - Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding [60.188309982690335]
We propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation.
By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding.
arXiv Detail & Related papers (2024-10-02T16:05:27Z) - ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity.
We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM.
ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z) - Soft Convex Quantization: Revisiting Vector Quantization with Convex
Optimization [40.1651740183975]
We propose Soft Convex Quantization (SCQ) as a direct substitute for Vector Quantization (VQ)
SCQ works like a differentiable convex optimization (DCO) layer.
We demonstrate its efficacy on the CIFAR-10, GTSRB and LSUN datasets.
arXiv Detail & Related papers (2023-10-04T17:45:14Z) - Progressive Text-to-Image Generation [40.09326229583334]
We present a progressive model for high-fidelity text-to-image generation.
The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context.
The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable.
arXiv Detail & Related papers (2022-10-05T14:27:20Z) - Lightweight Image Codec via Multi-Grid Multi-Block-Size Vector
Quantization (MGBVQ) [37.36588620264085]
We present a new method to remove pixel correlations.
By decomposing correlations into long- and short-range correlations, we represent long-range correlations in coarser grids.
We show that short-range correlations can be effectively coded by a suite of vector quantizers.
arXiv Detail & Related papers (2022-09-25T04:14:26Z) - MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation [41.029441562130984]
Two-stage Vector Quantized (VQ) generative models allow for synthesizing high-fidelity and high-resolution images.
Our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation.
arXiv Detail & Related papers (2022-09-19T13:26:51Z) - Hierarchical Residual Learning Based Vector Quantized Variational
Autoencoder for Image Reconstruction and Generation [19.92324010429006]
We propose a multi-layer variational autoencoder method, we call HR-VQVAE, that learns hierarchical discrete representations of the data.
We evaluate our method on the tasks of image reconstruction and generation.
arXiv Detail & Related papers (2022-08-09T06:04:25Z) - Vector Quantized Diffusion Model for Text-to-Image Synthesis [47.09451151258849]
We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation.
Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results.
arXiv Detail & Related papers (2021-11-29T18:59:46Z) - Hierarchical Conditional Flow: A Unified Framework for Image
Super-Resolution and Image Rescaling [139.25215100378284]
We propose a hierarchical conditional flow (HCFlow) as a unified framework for image SR and image rescaling.
HCFlow learns a mapping between HR and LR image pairs by modelling the distribution of the LR image and the rest high-frequency component simultaneously.
To further enhance the performance, other losses such as perceptual loss and GAN loss are combined with the commonly used negative log-likelihood loss in training.
arXiv Detail & Related papers (2021-08-11T16:11:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.