Related papers: MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization

MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization

URL: http://arxiv.org/abs/2507.07997v2
Date: Mon, 14 Jul 2025 09:33:58 GMT
Title: MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization
Authors: Mingkai Jia, Wei Yin, Xiaotao Hu, Jiaxin Guo, Xiaoyang Guo, Qian Zhang, Xiao-Xiao Long, Ping Tan,
Abstract summary: We propose MGVQ, a novel method to augment the representation capability of discrete codebooks.<n> MGVQ achieves the state-of-the-art performance on both ImageNet and 8 zero-shot benchmarks.<n>Results highlight the superiority of MGVQ in reconstruction and pave the way for preserving fidelity in HD image processing tasks.
Score: 35.57897644198773
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental models that compress continuous visual data into discrete tokens. Existing methods have tried to improve the quantization strategy for better reconstruction quality, however, there still exists a large gap between VQ-VAEs and VAEs. To narrow this gap, we propose MGVQ, a novel method to augment the representation capability of discrete codebooks, facilitating easier optimization for codebooks and minimizing information loss, thereby enhancing reconstruction quality. Specifically, we propose to retain the latent dimension to preserve encoded features and incorporate a set of sub-codebooks for quantization. Furthermore, we construct comprehensive zero-shot benchmarks featuring resolutions of 512p and 2k to evaluate the reconstruction performance of existing methods rigorously. MGVQ achieves the state-of-the-art performance on both ImageNet and 8 zero-shot benchmarks across all VQ-VAEs. Notably, compared with SD-VAE, we outperform them on ImageNet significantly, with rFID 0.49 v.s. 0.91, and achieve superior PSNR on all zero-shot benchmarks. These results highlight the superiority of MGVQ in reconstruction and pave the way for preserving fidelity in HD image processing tasks. Code will be publicly available at https://github.com/MKJia/MGVQ.

Related papers

Quantize-then-Rectify: Efficient VQ-VAE Training [71.92014859992263]
This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by quantization noise within the VAE's tolerance threshold.<n>We present textbfQuantize-then-Rectify (ReVQ), a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead.
arXiv Detail & Related papers (2025-07-14T17:59:41Z)
CODA: Repurposing Continuous VAEs for Discrete Tokenization [52.58960429582813]
textbfCODA(textbfCOntinuous-to-textbfDiscrete textbfAdaptation) is a framework that decouples compression and discretization.<n>Our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of $mathbf0.43$ and $mathbf1.34$ for $8 times$ and $16 times$ compression on ImageNet 256$times$ 256 benchmark.
arXiv Detail & Related papers (2025-03-22T12:59:00Z)
Dual Codebook VQ: Enhanced Image Reconstruction with Reduced Codebook Size [0.0]
Vector Quantization (VQ) techniques face challenges in codebook utilization, limiting reconstruction fidelity in image modeling.<n>We introduce a Dual Codebook mechanism that effectively addresses this limitation by partitioning the representation into complementary global and local components.<n>Our approach achieves significant FID improvements across diverse image domains, particularly excelling in scene and face reconstruction tasks.
arXiv Detail & Related papers (2025-03-13T19:31:18Z)
Scalable Image Tokenization with Index Backpropagation Quantization [74.15447383432262]
Index Backpropagation Quantization (IBQ) is a new VQ method for the joint optimization of all codebook embeddings and the visual encoder.<n>IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook with high dimension ($256$) and high utilization.
arXiv Detail & Related papers (2024-12-03T18:59:10Z)
XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation [54.2574228021317]
We present XQ-GAN, an image tokenization framework designed for both image reconstruction and generation tasks.<n>Our framework integrates state-of-the-art quantization techniques, including vector quantization (VQ), residual quantization (RQ), multi-scale residual quantization (MSVQ), product quantization (PQ), and binary spherical quantization (BSQ)<n>On the standard ImageNet 256x256 benchmark, our released model achieves an rFID of 0.64, significantly surpassing MAGVIT-v2 (0.9 rFID) and VAR (0.9 rFID)
arXiv Detail & Related papers (2024-12-02T17:58:06Z)
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM. ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z)
HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes [18.57499609338579]
We propose a novel framework to learn hierarchical discrete representation on the basis of the variational Bayes framework, called hierarchically quantized variational autoencoder (HQ-VAE) HQ-VAE naturally generalizes the hierarchical variants of VQ-VAE, such as VQ-VAE-2 and residual-quantized VAE (RQ-VAE) Our comprehensive experiments on image datasets show that HQ-VAE enhances codebook usage and improves reconstruction performance.
arXiv Detail & Related papers (2023-12-31T01:39:38Z)
VQFR: Blind Face Restoration with Vector-Quantized Dictionary and Parallel Decoder [83.63843671885716]
We propose a VQ-based face restoration method -- VQFR. VQFR takes advantage of high-quality low-level feature banks extracted from high-quality faces. To further fuse low-level features from inputs while not "contaminating" the realistic details generated from the VQ codebook, we proposed a parallel decoder.
arXiv Detail & Related papers (2022-05-13T17:54:40Z)
Autoregressive Image Generation using Residual Quantization [40.04085054791994]
We propose a two-stage framework to generate high-resolution images. The framework consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer. Our approach has a significantly faster sampling speed than previous AR models to generate high-quality images.
arXiv Detail & Related papers (2022-03-03T11:44:46Z)
Hierarchical Quantized Autoencoders [3.9146761527401432]
We motivate the use of a hierarchy of Vector Quantized Variencoders (VQ-VAEs) to attain high factors of compression. We show that a combination of quantization and hierarchical latent structure aids likelihood-based image compression. Our resulting scheme produces a Markovian series of latent variables that reconstruct images of high-perceptual quality.
arXiv Detail & Related papers (2020-02-19T11:26:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.