Related papers: OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

URL: http://arxiv.org/abs/2601.15369v1
Date: Wed, 21 Jan 2026 18:47:12 GMT
Title: OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation
Authors: Letian Zhang, Sucheng Ren, Yanqing Liu, Xianhang Li, Zeyu Wang, Yuyin Zhou, Huaxiu Yao, Zeyu Zheng, Weili Nie, Guilin Liu, Zhiding Yu, Cihang Xie,
Abstract summary: This paper presents a family of advanced vision encoders, named OpenVision 3, that learns a single, unified visual representation.<n>Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles.<n>For multimodal understanding, we plug the encoder into the LLaVA-1.5 framework; for generation, we test it under the RAE framework.
Score: 101.82480298904225
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features. By jointly optimizing reconstruction- and semantics-driven signals in a shared latent space, the encoder learns representations that synergize and generalize well across both regimes. We validate this unified design through extensive downstream evaluations with the encoder frozen. For multimodal understanding, we plug the encoder into the LLaVA-1.5 framework: it performs comparably with a standard CLIP vision encoder (e.g., 62.4 vs 62.2 on SeedBench, and 83.7 vs 82.9 on POPE). For generation, we test it under the RAE framework: ours substantially surpasses the standard CLIP-based encoder (e.g., gFID: 1.89 vs 2.54 on ImageNet). We hope this work can spur future research on unified modeling.

Related papers

OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence [113.73007911004446]
OneVision-Encoder encodes video by compressing visual structure into semantic meaning.<n>Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.
arXiv Detail & Related papers (2026-02-09T14:06:17Z)
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction [83.50898344094153]
VQRAE produces Continuous semantic features for image understanding and tokens for visual generation within a unified tokenizer.<n>Design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens.<n>VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction.
arXiv Detail & Related papers (2025-11-28T17:26:34Z)
VUGEN: Visual Understanding priors for GENeration [18.840804846528865]
VUGEN is a novel framework that explicitly leverages VLM's pretrained visual understanding priors for efficient and high-quality image generation.<n>Our approach first transforms the high-dimensional latent space of the VLM's native vision encoder into a lower-dimensional, tractable distribution.<n>A dedicated pixel decoder maps these generated latents back to the image space.
arXiv Detail & Related papers (2025-10-08T00:04:47Z)
Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models [37.59115132356727]
We propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation.<n>On ImageNet 256$times$256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs.<n>Our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.
arXiv Detail & Related papers (2025-09-29T17:57:39Z)
Unified Multimodal Model as Auto-Encoder [69.38946823657592]
We introduce a paradigm regarding understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text.<n>Our empirical results suggest that understanding can largely enhance generation (verified on GenEval), while generation, in turn, notably strengthens fine-grained visual perception.
arXiv Detail & Related papers (2025-09-11T17:57:59Z)
FLIER: Few-shot Language Image Models Embedded with Latent Representations [2.443383032451177]
Few-shot Language Image model embedded with latent representations (FLIER) for image recognition. We first generate images and corresponding latent representations via Stable Diffusion with the textual inputs from GPT-3. With latent representations as "models-understandable pixels", we introduce a flexible convolutional neural network with two convolutional layers to be the latent encoder.
arXiv Detail & Related papers (2024-10-10T06:27:46Z)
VNVC: A Versatile Neural Video Coding Framework for Efficient Human-Machine Vision [59.632286735304156]
It is more efficient to enhance/analyze the coded representations directly without decoding them into pixels. We propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis.
arXiv Detail & Related papers (2023-06-19T03:04:57Z)
LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval [117.15862403330121]
We propose LoopITR, which combines dual encoders and cross encoders in the same network for joint learning. Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder.
arXiv Detail & Related papers (2022-03-10T16:41:12Z)
Distilled Dual-Encoder Model for Vision-Language Understanding [50.42062182895373]
We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks. We show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements.
arXiv Detail & Related papers (2021-12-16T09:21:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.