MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation
- URL: http://arxiv.org/abs/2209.09002v1
- Date: Mon, 19 Sep 2022 13:26:51 GMT
- Title: MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation
- Authors: Chuanxia Zheng and Long Tung Vuong and Jianfei Cai and Dinh Phung
- Abstract summary: Two-stage Vector Quantized (VQ) generative models allow for synthesizing high-fidelity and high-resolution images.
Our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation.
- Score: 41.029441562130984
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Although two-stage Vector Quantized (VQ) generative models allow for
synthesizing high-fidelity and high-resolution images, their quantization
operator encodes similar patches within an image into the same index, resulting
in a repeated artifact for similar adjacent regions using existing decoder
architectures. To address this issue, we propose to incorporate the spatially
conditional normalization to modulate the quantized vectors so as to insert
spatially variant information to the embedded index maps, encouraging the
decoder to generate more photorealistic images. Moreover, we use multichannel
quantization to increase the recombination capability of the discrete codes
without increasing the cost of model and codebook. Additionally, to generate
discrete tokens at the second stage, we adopt a Masked Generative Image
Transformer (MaskGIT) to learn an underlying prior distribution in the
compressed latent space, which is much faster than the conventional
autoregressive model. Experiments on two benchmark datasets demonstrate that
our proposed modulated VQGAN is able to greatly improve the reconstructed image
quality as well as provide high-fidelity image generation.
Related papers
- SQ-GAN: Semantic Image Communications Using Masked Vector Quantization [55.02795214161371]
This work introduces Semantically Masked VQ-GAN (SQ-GAN), a novel approach to optimize image compression for semantic/task-oriented communications.
SQ-GAN employs off-the-shelf semantic semantic segmentation and a new semantic-conditioned adaptive mask module (SAMM) to selectively encode semantically significant features of the images.
arXiv Detail & Related papers (2025-02-13T17:35:57Z) - Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which elevates non-autoregressive masked image modeling (MIM) text-to-image to a level comparable with state-of-the-art diffusion models like SDXL.
We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers.
Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - AICT: An Adaptive Image Compression Transformer [18.05997169440533]
We propose a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT)
The proposed ICT can capture both global and local contexts from the latent representations.
We leverage a learnable scaling module with a sandwich ConvNeXt-based pre/post-processor to accurately extract more compact latent representation.
arXiv Detail & Related papers (2023-07-12T11:32:02Z) - Not All Image Regions Matter: Masked Vector Quantization for
Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook.
We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z) - Towards Accurate Image Coding: Improved Autoregressive Image Generation
with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm.
We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z) - Neural Data-Dependent Transform for Learned Image Compression [72.86505042102155]
We build a neural data-dependent transform and introduce a continuous online mode decision mechanism to jointly optimize the coding efficiency for each individual image.
The experimental results show the effectiveness of the proposed neural-syntax design and the continuous online mode decision mechanism.
arXiv Detail & Related papers (2022-03-09T14:56:48Z) - Autoregressive Image Generation using Residual Quantization [40.04085054791994]
We propose a two-stage framework to generate high-resolution images.
The framework consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer.
Our approach has a significantly faster sampling speed than previous AR models to generate high-quality images.
arXiv Detail & Related papers (2022-03-03T11:44:46Z) - Transformer-based Image Compression [18.976159633970177]
Transformer-based Image Compression (TIC) approach is developed which reuses the canonical variational autoencoder (VAE) architecture with paired main and hyper encoder-decoders.
TIC rivals with state-of-the-art approaches including deep convolutional neural networks (CNNs) based learnt image coding (LIC) methods and handcrafted rules-based intra profile of recently-approved Versatile Video Coding (VVC) standard.
arXiv Detail & Related papers (2021-11-12T13:13:20Z) - Generating Images with Sparse Representations [21.27273495926409]
High dimensionality of images presents architecture and sampling-efficiency challenges for likelihood-based generative models.
We present an alternative approach, inspired by common image compression methods like JPEG, and convert images to quantized discrete cosine transform (DCT) blocks.
We propose a Transformer-based autoregressive architecture, which is trained to sequentially predict the conditional distribution of the next element in such sequences.
arXiv Detail & Related papers (2021-03-05T17:56:03Z) - Generalized Octave Convolutions for Learned Multi-Frequency Image
Compression [20.504561050200365]
We propose the first learned multi-frequency image compression and entropy coding approach.
It is based on the recently developed octave convolutions to factorize the latents into high and low frequency (resolution) components.
We show that the proposed generalized octave convolution can improve the performance of other auto-encoder-based computer vision tasks.
arXiv Detail & Related papers (2020-02-24T01:35:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.