Quantize-then-Rectify: Efficient VQ-VAE Training
- URL: http://arxiv.org/abs/2507.10547v1
- Date: Mon, 14 Jul 2025 17:59:41 GMT
- Title: Quantize-then-Rectify: Efficient VQ-VAE Training
- Authors: Borui Zhang, Qihang Rao, Wenzhao Zheng, Jie Zhou, Jiwen Lu,
- Abstract summary: This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by quantization noise within the VAE's tolerance threshold.<n>We present textbfQuantize-then-Rectify (ReVQ), a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead.
- Score: 71.92014859992263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual tokenizers are pivotal in multimodal large models, acting as bridges between continuous inputs and discrete tokens. Nevertheless, training high-compression-rate VQ-VAEs remains computationally demanding, often necessitating thousands of GPU hours. This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by controlling quantization noise within the VAE's tolerance threshold. We present \textbf{Quantize-then-Rectify (ReVQ)}, a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead. By integrating \textbf{channel multi-group quantization} to enlarge codebook capacity and a \textbf{post rectifier} to mitigate quantization errors, ReVQ compresses ImageNet images into at most 512 tokens while sustaining competitive reconstruction quality (rFID = 1.06). Significantly, ReVQ reduces training costs by over two orders of magnitude relative to state-of-the-art approaches: ReVQ finishes full training on a single NVIDIA 4090 in approximately 22 hours, whereas comparable methods require 4.5 days on 32 A100 GPUs. Experimental results show that ReVQ achieves superior efficiency-reconstruction trade-offs.
Related papers
- MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization [35.57897644198773]
We propose MGVQ, a novel method to augment the representation capability of discrete codebooks.<n> MGVQ achieves the state-of-the-art performance on both ImageNet and 8 zero-shot benchmarks.<n>Results highlight the superiority of MGVQ in reconstruction and pave the way for preserving fidelity in HD image processing tasks.
arXiv Detail & Related papers (2025-07-10T17:59:54Z) - SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer [45.720721058671856]
SoftVQ-VAE is a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token.<n>Our approach compresses 256x256 and 512x512 images using as few as 32 or 64 1-dimensional tokens.<n>Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256x256 images and 55x for 512x512 images.
arXiv Detail & Related papers (2024-12-14T20:29:29Z) - XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation [54.2574228021317]
We present XQ-GAN, an image tokenization framework designed for both image reconstruction and generation tasks.<n>Our framework integrates state-of-the-art quantization techniques, including vector quantization (VQ), residual quantization (RQ), multi-scale residual quantization (MSVQ), product quantization (PQ), and binary spherical quantization (BSQ)<n>On the standard ImageNet 256x256 benchmark, our released model achieves an rFID of 0.64, significantly surpassing MAGVIT-v2 (0.9 rFID) and VAR (0.9 rFID)
arXiv Detail & Related papers (2024-12-02T17:58:06Z) - P4Q: Learning to Prompt for Quantization in Visual-language Models [38.87018242616165]
We propose a method that balances fine-tuning and quantization named Prompt for Quantization'' (P4Q)
Our method can effectively reduce the gap between image features and text features caused by low-bit quantization.
Our 8-bit P4Q can theoretically compress the CLIP-ViT/B-32 by 4 $times$ while achieving 66.94% Top-1 accuracy.
arXiv Detail & Related papers (2024-09-26T08:31:27Z) - ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity.
We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM.
ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - GPTVQ: The Blessing of Dimensionality for LLM Quantization [15.324536807105032]
We show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality.<n>We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs)<n>Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE.
arXiv Detail & Related papers (2024-02-23T13:39:16Z) - Towards Efficient Post-training Quantization of Pre-trained Language
Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues.
Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z) - Mastering Visual Continuous Control: Improved Data-Augmented
Reinforcement Learning [114.35801511501639]
We present DrQ-v2, a model-free reinforcement learning algorithm for visual continuous control.
DrQ-v2 builds on DrQ, an off-policy actor-critic approach that uses data augmentation to learn directly from pixels.
Notably, DrQ-v2 is able to solve complex humanoid locomotion tasks directly from pixel observations.
arXiv Detail & Related papers (2021-07-20T17:29:13Z) - BRECQ: Pushing the Limit of Post-Training Quantization by Block
Reconstruction [29.040991149922615]
We study the challenging task of neural network quantization without end-to-end retraining, called Post-training Quantization (PTQ)
We propose a novel PTQ framework, dubbed BRECQ, which pushes the limits of bitwidth in PTQ down to INT2 for the first time.
For the first time we prove that, without bells and whistles, PTQ can attain 4-bit ResNet and MobileNetV2 comparable with QAT and enjoy 240 times faster production of quantized models.
arXiv Detail & Related papers (2021-02-10T13:46:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.