MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation
- URL: http://arxiv.org/abs/2209.09002v1
- Date: Mon, 19 Sep 2022 13:26:51 GMT
- Title: MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation
- Authors: Chuanxia Zheng and Long Tung Vuong and Jianfei Cai and Dinh Phung
- Abstract summary: Two-stage Vector Quantized (VQ) generative models allow for synthesizing high-fidelity and high-resolution images.
Our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation.
- Score: 41.029441562130984
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Although two-stage Vector Quantized (VQ) generative models allow for
synthesizing high-fidelity and high-resolution images, their quantization
operator encodes similar patches within an image into the same index, resulting
in a repeated artifact for similar adjacent regions using existing decoder
architectures. To address this issue, we propose to incorporate the spatially
conditional normalization to modulate the quantized vectors so as to insert
spatially variant information to the embedded index maps, encouraging the
decoder to generate more photorealistic images. Moreover, we use multichannel
quantization to increase the recombination capability of the discrete codes
without increasing the cost of model and codebook. Additionally, to generate
discrete tokens at the second stage, we adopt a Masked Generative Image
Transformer (MaskGIT) to learn an underlying prior distribution in the
compressed latent space, which is much faster than the conventional
autoregressive model. Experiments on two benchmark datasets demonstrate that
our proposed modulated VQGAN is able to greatly improve the reconstructed image
quality as well as provide high-fidelity image generation.
Related papers
- Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL.
We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution.
Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - MaskBit: Embedding-free Image Generation via Bit Tokens [54.827480008982185]
We present an empirical and systematic examination of VQGANs, leading to a modernized VQGAN.
A novel embedding-free generation network operating directly on bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark, with a compact generator model of mere 305M parameters.
arXiv Detail & Related papers (2024-09-24T16:12:12Z) - Content-aware Masked Image Modeling Transformer for Stereo Image Compression [15.819672238043786]
We propose a stereo image compression framework, named CAMSIC.
CAMSIC transforms each image to latent representation and employs a powerful decoder-free Transformer entropy model.
Experiments show that our framework achieves state-of-the-art rate-distortion performance on two stereo image datasets.
arXiv Detail & Related papers (2024-03-13T13:12:57Z) - AICT: An Adaptive Image Compression Transformer [18.05997169440533]
We propose a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT)
The proposed ICT can capture both global and local contexts from the latent representations.
We leverage a learnable scaling module with a sandwich ConvNeXt-based pre/post-processor to accurately extract more compact latent representation.
arXiv Detail & Related papers (2023-07-12T11:32:02Z) - Not All Image Regions Matter: Masked Vector Quantization for
Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook.
We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z) - Towards Accurate Image Coding: Improved Autoregressive Image Generation
with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm.
We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z) - Neural Data-Dependent Transform for Learned Image Compression [72.86505042102155]
We build a neural data-dependent transform and introduce a continuous online mode decision mechanism to jointly optimize the coding efficiency for each individual image.
The experimental results show the effectiveness of the proposed neural-syntax design and the continuous online mode decision mechanism.
arXiv Detail & Related papers (2022-03-09T14:56:48Z) - Transformer-based Image Compression [18.976159633970177]
Transformer-based Image Compression (TIC) approach is developed which reuses the canonical variational autoencoder (VAE) architecture with paired main and hyper encoder-decoders.
TIC rivals with state-of-the-art approaches including deep convolutional neural networks (CNNs) based learnt image coding (LIC) methods and handcrafted rules-based intra profile of recently-approved Versatile Video Coding (VVC) standard.
arXiv Detail & Related papers (2021-11-12T13:13:20Z) - Generating Images with Sparse Representations [21.27273495926409]
High dimensionality of images presents architecture and sampling-efficiency challenges for likelihood-based generative models.
We present an alternative approach, inspired by common image compression methods like JPEG, and convert images to quantized discrete cosine transform (DCT) blocks.
We propose a Transformer-based autoregressive architecture, which is trained to sequentially predict the conditional distribution of the next element in such sequences.
arXiv Detail & Related papers (2021-03-05T17:56:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.