Towards Accurate Image Coding: Improved Autoregressive Image Generation
with Dynamic Vector Quantization
- URL: http://arxiv.org/abs/2305.11718v1
- Date: Fri, 19 May 2023 14:56:05 GMT
- Title: Towards Accurate Image Coding: Improved Autoregressive Image Generation
with Dynamic Vector Quantization
- Authors: Mengqi Huang, Zhendong Mao, Zhuowei Chen, Yongdong Zhang
- Abstract summary: Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm.
We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
- Score: 73.52943587514386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing vector quantization (VQ) based autoregressive models follow a
two-stage generation paradigm that first learns a codebook to encode images as
discrete codes, and then completes generation based on the learned codebook.
However, they encode fixed-size image regions into fixed-length codes and
ignore their naturally different information densities, which results in
insufficiency in important regions and redundancy in unimportant ones, and
finally degrades the generation quality and speed. Moreover, the fixed-length
coding leads to an unnatural raster-scan autoregressive generation. To address
the problem, we propose a novel two-stage framework: (1) Dynamic-Quantization
VAE (DQ-VAE) which encodes image regions into variable-length codes based on
their information densities for an accurate and compact code representation.
(2) DQ-Transformer which thereby generates images autoregressively from
coarse-grained (smooth regions with fewer codes) to fine-grained (details
regions with more codes) by modeling the position and content of codes in each
granularity alternately, through a novel stacked-transformer architecture and
shared-content, non-shared position input layers designs. Comprehensive
experiments on various generation tasks validate our superiorities in both
effectiveness and efficiency. Code will be released at
https://github.com/CrossmodalGroup/DynamicVectorQuantization.
Related papers
- HybridFlow: Infusing Continuity into Masked Codebook for Extreme Low-Bitrate Image Compression [51.04820313355164]
HyrbidFlow combines the continuous-feature-based and codebook-based streams to achieve both high perceptual quality and high fidelity under extreme lows.
Experimental results demonstrate superior performance across several datasets under extremely lows.
arXiv Detail & Related papers (2024-04-20T13:19:08Z) - Not All Image Regions Matter: Masked Vector Quantization for
Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook.
We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z) - SC-VAE: Sparse Coding-based Variational Autoencoder with Learned ISTA [0.6770292596301478]
We introduce a new VAE variant, termed sparse coding-based VAE with learned ISTA (SC-VAE), which integrates sparse coding within variational autoencoder framework.
Experiments on two image datasets demonstrate that our model achieves improved image reconstruction results compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-03-29T13:18:33Z) - MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation [41.029441562130984]
Two-stage Vector Quantized (VQ) generative models allow for synthesizing high-fidelity and high-resolution images.
Our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation.
arXiv Detail & Related papers (2022-09-19T13:26:51Z) - Style Transformer for Image Inversion and Editing [35.45674653596084]
Existing GAN inversion methods fail to provide latent codes for reliable reconstruction and flexible editing simultaneously.
This paper presents a transformer-based image inversion and editing model for pretrained StyleGAN.
The proposed model employs a CNN encoder to provide multi-scale image features as keys and values.
arXiv Detail & Related papers (2022-03-15T14:16:57Z) - Unpaired Image-to-Image Translation via Latent Energy Transport [61.62293304236371]
Image-to-image translation aims to preserve source contents while translating to discriminative target styles between two visual domains.
In this paper, we propose to deploy an energy-based model (EBM) in the latent space of a pretrained autoencoder for this task.
Our model is the first to be applicable to 1024$times$1024-resolution unpaired image translation.
arXiv Detail & Related papers (2020-12-01T17:18:58Z) - Free-Form Image Inpainting via Contrastive Attention Network [64.05544199212831]
In image inpainting tasks, masks with any shapes can appear anywhere in images which form complex patterns.
It is difficult for encoders to capture such powerful representations under this complex situation.
We propose a self-supervised Siamese inference network to improve the robustness and generalization.
arXiv Detail & Related papers (2020-10-29T14:46:05Z) - Swapping Autoencoder for Deep Image Manipulation [94.33114146172606]
We propose the Swapping Autoencoder, a deep model designed specifically for image manipulation.
The key idea is to encode an image with two independent components and enforce that any swapped combination maps to a realistic image.
Experiments on multiple datasets show that our model produces better results and is substantially more efficient compared to recent generative models.
arXiv Detail & Related papers (2020-07-01T17:59:57Z) - Consistent Multiple Sequence Decoding [36.46573114422263]
We introduce a consistent multiple sequence decoding architecture.
This architecture allows for consistent and simultaneous decoding of an arbitrary number of sequences.
We show the efficacy of our consistent multiple sequence decoder on the task of dense relational image captioning.
arXiv Detail & Related papers (2020-04-02T00:43:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.