Related papers: MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation

MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation

URL: http://arxiv.org/abs/2209.09002v1
Date: Mon, 19 Sep 2022 13:26:51 GMT
Title: MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation
Authors: Chuanxia Zheng and Long Tung Vuong and Jianfei Cai and Dinh Phung
Abstract summary: Two-stage Vector Quantized (VQ) generative models allow for synthesizing high-fidelity and high-resolution images. Our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation.
Score: 41.029441562130984
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Although two-stage Vector Quantized (VQ) generative models allow for synthesizing high-fidelity and high-resolution images, their quantization operator encodes similar patches within an image into the same index, resulting in a repeated artifact for similar adjacent regions using existing decoder architectures. To address this issue, we propose to incorporate the spatially conditional normalization to modulate the quantized vectors so as to insert spatially variant information to the embedded index maps, encouraging the decoder to generate more photorealistic images. Moreover, we use multichannel quantization to increase the recombination capability of the discrete codes without increasing the cost of model and codebook. Additionally, to generate discrete tokens at the second stage, we adopt a Masked Generative Image Transformer (MaskGIT) to learn an underlying prior distribution in the compressed latent space, which is much faster than the conventional autoregressive model. Experiments on two benchmark datasets demonstrate that our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation.

Related papers

One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models [65.96186414865747]
Text-to-Image (T2I) diffusion models face a trade-off between inference speed and image quality.<n>We introduce the first Time-independent Unified TiUE for the student model UNet architecture.<n>Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling.
arXiv Detail & Related papers (2025-05-28T04:23:22Z)
SQ-GAN: Semantic Image Communications Using Masked Vector Quantization [55.02795214161371]
This work introduces Semantically Masked VQ-GAN (SQ-GAN), a novel approach to optimize image compression for semantic/task-oriented communications. SQ-GAN employs off-the-shelf semantic semantic segmentation and a new semantic-conditioned adaptive mask module (SAMM) to selectively encode semantically significant features of the images.
arXiv Detail & Related papers (2025-02-13T17:35:57Z)
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL. We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution. Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z)
MaskBit: Embedding-free Image Generation via Bit Tokens [54.827480008982185]
We present an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. A novel embedding-free generation network operating directly on bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark, with a compact generator model of mere 305M parameters.
arXiv Detail & Related papers (2024-09-24T16:12:12Z)
Content-aware Masked Image Modeling Transformer for Stereo Image Compression [15.819672238043786]
We propose a stereo image compression framework, named CAMSIC. CAMSIC transforms each image to latent representation and employs a powerful decoder-free Transformer entropy model. Experiments show that our framework achieves state-of-the-art rate-distortion performance on two stereo image datasets.
arXiv Detail & Related papers (2024-03-13T13:12:57Z)
AICT: An Adaptive Image Compression Transformer [18.05997169440533]
We propose a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT) The proposed ICT can capture both global and local contexts from the latent representations. We leverage a learnable scaling module with a sandwich ConvNeXt-based pre/post-processor to accurately extract more compact latent representation.
arXiv Detail & Related papers (2023-07-12T11:32:02Z)
Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook. We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z)
Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm. We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z)
Neural Data-Dependent Transform for Learned Image Compression [72.86505042102155]
We build a neural data-dependent transform and introduce a continuous online mode decision mechanism to jointly optimize the coding efficiency for each individual image. The experimental results show the effectiveness of the proposed neural-syntax design and the continuous online mode decision mechanism.
arXiv Detail & Related papers (2022-03-09T14:56:48Z)
Autoregressive Image Generation using Residual Quantization [40.04085054791994]
We propose a two-stage framework to generate high-resolution images. The framework consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer. Our approach has a significantly faster sampling speed than previous AR models to generate high-quality images.
arXiv Detail & Related papers (2022-03-03T11:44:46Z)
Transformer-based Image Compression [18.976159633970177]
Transformer-based Image Compression (TIC) approach is developed which reuses the canonical variational autoencoder (VAE) architecture with paired main and hyper encoder-decoders. TIC rivals with state-of-the-art approaches including deep convolutional neural networks (CNNs) based learnt image coding (LIC) methods and handcrafted rules-based intra profile of recently-approved Versatile Video Coding (VVC) standard.
arXiv Detail & Related papers (2021-11-12T13:13:20Z)
Generating Images with Sparse Representations [21.27273495926409]
High dimensionality of images presents architecture and sampling-efficiency challenges for likelihood-based generative models. We present an alternative approach, inspired by common image compression methods like JPEG, and convert images to quantized discrete cosine transform (DCT) blocks. We propose a Transformer-based autoregressive architecture, which is trained to sequentially predict the conditional distribution of the next element in such sequences.
arXiv Detail & Related papers (2021-03-05T17:56:03Z)
Generalized Octave Convolutions for Learned Multi-Frequency Image Compression [20.504561050200365]
We propose the first learned multi-frequency image compression and entropy coding approach. It is based on the recently developed octave convolutions to factorize the latents into high and low frequency (resolution) components. We show that the proposed generalized octave convolution can improve the performance of other auto-encoder-based computer vision tasks.
arXiv Detail & Related papers (2020-02-24T01:35:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.