VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
- URL: http://arxiv.org/abs/2511.23386v1
- Date: Fri, 28 Nov 2025 17:26:34 GMT
- Title: VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
- Authors: Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan,
- Abstract summary: VQRAE produces Continuous semantic features for image understanding and tokens for visual generation within a unified tokenizer.<n>Design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens.<n>VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction.
- Score: 83.50898344094153
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.
Related papers
- ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation [64.84095852784714]
Residual Tokenizer (ResTok) is a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens.<n>We show that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps.
arXiv Detail & Related papers (2026-01-07T14:09:18Z) - Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing [62.94394079771687]
A burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents.<n>We propose a systematic framework to adapt understanding-oriented encoder features for generative tasks.<n>We show that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both Text-to-Image (T2I) and image editing tasks.
arXiv Detail & Related papers (2025-12-19T18:59:57Z) - Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models [37.59115132356727]
We propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation.<n>On ImageNet 256$times$256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs.<n>Our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.
arXiv Detail & Related papers (2025-09-29T17:57:39Z) - Unified Multimodal Model as Auto-Encoder [69.38946823657592]
We introduce a paradigm regarding understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text.<n>Our empirical results suggest that understanding can largely enhance generation (verified on GenEval), while generation, in turn, notably strengthens fine-grained visual perception.
arXiv Detail & Related papers (2025-09-11T17:57:59Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - Dual Codebook VQ: Enhanced Image Reconstruction with Reduced Codebook Size [0.0]
Vector Quantization (VQ) techniques face challenges in codebook utilization, limiting reconstruction fidelity in image modeling.<n>We introduce a Dual Codebook mechanism that effectively addresses this limitation by partitioning the representation into complementary global and local components.<n>Our approach achieves significant FID improvements across diverse image domains, particularly excelling in scene and face reconstruction tasks.
arXiv Detail & Related papers (2025-03-13T19:31:18Z) - SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation [71.68085485928007]
We introduce SemHiTok, a unified image tokenizer via Semantic-Guided Hierarchical codebook.<n>We show that SemHiTok achieves SOTA performance in image reconstruction and multimodal understanding under LLaVA-v1.5 setting.<n>We also develop a unified MLLM with SemHiTok, which exhibits superior performance across multimodal understanding and generation tasks.
arXiv Detail & Related papers (2025-03-09T20:42:34Z) - Epsilon-VAE: Denoising as Visual Decoding [61.29255979767292]
We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement.<n>Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image.<n>By adopting iterative reconstruction through diffusion, our autoencoder, namely Epsilon-VAE, achieves high reconstruction quality.
arXiv Detail & Related papers (2024-10-05T08:27:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.