VUGEN: Visual Understanding priors for GENeration
- URL: http://arxiv.org/abs/2510.06529v1
- Date: Wed, 08 Oct 2025 00:04:47 GMT
- Title: VUGEN: Visual Understanding priors for GENeration
- Authors: Xiangyi Chen, Théophane Vallaeys, Maha Elbayad, John Nguyen, Jakob Verbeek,
- Abstract summary: VUGEN is a novel framework that explicitly leverages VLM's pretrained visual understanding priors for efficient and high-quality image generation.<n>Our approach first transforms the high-dimensional latent space of the VLM's native vision encoder into a lower-dimensional, tractable distribution.<n>A dedicated pixel decoder maps these generated latents back to the image space.
- Score: 18.840804846528865
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in Vision-Language Models (VLMs) have enabled unified understanding across text and images, yet equipping these models with robust image generation capabilities remains challenging. Existing approaches often rely on reconstruction-oriented autoencoders or complex bridging mechanisms, leading to misalignment between understanding and generation representations, or architectural complexity. In this work, we propose VUGEN, a novel framework that explicitly leverages VLM's pretrained visual understanding priors for efficient and high-quality image generation. Our approach first transforms the high-dimensional latent space of the VLM's native vision encoder into a lower-dimensional, tractable distribution that maximally preserves visual information. The VLM is then trained to sample within this reduced latent space, ensuring alignment with its visual understanding capabilities. Finally, a dedicated pixel decoder maps these generated latents back to the image space. We find that a VAE-free pixel diffusion decoder to be on par or better than commonly used complex latent diffusion decoders that internally rely on VAE latents. Extensive experiments demonstrate that VUGEN achieves superior image generation performance, improving DPG Bench from 71.17 to 74.32 and FID from 11.86 to 9.06 on COCO, while fully preserving the VLM's original understanding capabilities.
Related papers
- Improving Reconstruction of Representation Autoencoder [52.817427902597416]
We propose LV-RAE, a representation autoencoder that augments semantic features with missing low-level information.<n>Our experiments demonstrate that LV-RAE significantly improves reconstruction fidelity while preserving the semantic abstraction.
arXiv Detail & Related papers (2026-02-09T13:12:35Z) - One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation [33.56782043207013]
Feature Auto-Encoder (FAE) adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer.<n>FAE achieves strong performance across class-conditional and text-to-image benchmarks.
arXiv Detail & Related papers (2025-12-08T18:57:26Z) - Visual Generation Tuning [84.50113837230333]
We propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within vision language models.<n>In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs.<n>Our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation.
arXiv Detail & Related papers (2025-11-28T18:57:13Z) - VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction [83.50898344094153]
VQRAE produces Continuous semantic features for image understanding and tokens for visual generation within a unified tokenizer.<n>Design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens.<n>VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction.
arXiv Detail & Related papers (2025-11-28T17:26:34Z) - UniFusion: Vision-Language Model as Unified Encoder in Image Generation [12.811191961286852]
We present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder.<n>We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing.<n>We propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI), which conditions a diffusion transformer (DiT) only on the text tokens generated by the VLM during in-model prompt rewriting.
arXiv Detail & Related papers (2025-10-14T17:57:56Z) - Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models [9.24989979549793]
Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks.<n>These models typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision.<n>We introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency.
arXiv Detail & Related papers (2025-09-23T16:07:18Z) - ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding [13.295759874474767]
We introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for vision-language models (VLMs)<n>ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation.<n>Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states.
arXiv Detail & Related papers (2025-09-17T11:28:58Z) - EVEv2: Improved Baselines for Encoder-Free Vision-Language Models [72.07868838411474]
Existing encoder-free vision-language models (VLMs) are narrowing the performance gap with their encoder-based counterparts.<n>We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones.<n>We show that properly and hierarchically associating vision and language within a unified model reduces interference between modalities.
arXiv Detail & Related papers (2025-02-10T18:59:58Z) - FLIER: Few-shot Language Image Models Embedded with Latent Representations [2.443383032451177]
Few-shot Language Image model embedded with latent representations (FLIER) for image recognition.
We first generate images and corresponding latent representations via Stable Diffusion with the textual inputs from GPT-3.
With latent representations as "models-understandable pixels", we introduce a flexible convolutional neural network with two convolutional layers to be the latent encoder.
arXiv Detail & Related papers (2024-10-10T06:27:46Z) - Pixel-Aligned Multi-View Generation with Depth Guided Decoder [86.1813201212539]
We propose a novel method for pixel-level image-to-multi-view generation.
Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model.
Our model enables better pixel alignment across multi-view images.
arXiv Detail & Related papers (2024-08-26T04:56:41Z) - Unveiling Encoder-Free Vision-Language Models [62.52803514667452]
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks.
We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs.
We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
arXiv Detail & Related papers (2024-06-17T17:59:44Z) - High Fidelity Image Synthesis With Deep VAEs In Latent Space [0.0]
We present fast, realistic image generation on high-resolution, multimodal datasets using hierarchical variational autoencoders (VAEs)
In this two-stage setup, the autoencoder compresses the image into its semantic features, which are then modeled with a deep VAE.
We demonstrate the effectiveness of our two-stage approach, achieving a FID of 9.34 on the ImageNet-256 dataset which is comparable to BigGAN.
arXiv Detail & Related papers (2023-03-23T23:45:19Z) - Towards Coding for Human and Machine Vision: A Scalable Image Coding
Approach [104.02201472370801]
We come up with a novel image coding framework by leveraging both the compressive and the generative models.
By introducing advanced generative models, we train a flexible network to reconstruct images from compact feature representations and the reference pixels.
Experimental results demonstrate the superiority of our framework in both human visual quality and facial landmark detection.
arXiv Detail & Related papers (2020-01-09T10:37:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.