UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
- URL: http://arxiv.org/abs/2510.10575v1
- Date: Sun, 12 Oct 2025 12:50:23 GMT
- Title: UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
- Authors: Zhengrong Yue, Haiyu Zhang, Xiangyu Zeng, Boyu Chen, Chenting Wang, Shaobin Zhuang, Lu Dong, KunPeng Du, Yi Wang, Limin Wang, Yali Wang,
- Abstract summary: We propose a generic and unified tokenizer, namely UniFlow, by flexibly adapting any visual encoder with a concise reconstruction decoder.<n>Specifically, we introduce layer-wise adaptive self-distillation applied to the well-pretrained visual encoders, which enables UniFlow to simultaneously inherit the strong semantic features for visual understanding.<n>We also propose a lightweight patch-wise pixel flow decoder, which efficiently achieves high-fidelity pixel reconstruction.
- Score: 34.32666840641278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tokenizer is a crucial component for both visual understanding and generation. To advance toward the ultimate goal of universal modeling, recent research has focused on developing a unified tokenizer. However, existing tokenizers face a significant performance trade-off between understanding and generation, stemming from the inherent conflict between high-level semantic abstraction and low-level pixel reconstruction. To tackle this challenge, we propose a generic and unified tokenizer, namely UniFlow, by flexibly adapting any visual encoder with a concise reconstruction decoder. Specifically, we introduce layer-wise adaptive self-distillation applied to the well-pretrained visual encoders, which enables UniFlow to simultaneously inherit the strong semantic features for visual understanding and flexibly adapt to model fine-grained details for visual generation. Moreover, we propose a lightweight patch-wise pixel flow decoder, which efficiently achieves high-fidelity pixel reconstruction by modeling a conditional flow from the noisy state back to the patch-wise pixel domain. By leveraging the semantic features as visual conditions for the decoder, we effectively alleviate the training conflicts between understanding and generation. Furthermore, the patch-wise learning strategy simplifies the data distribution, thereby improving training efficiency. Extensive experiments across 13 challenging benchmarks spanning 7 widely studied visual understanding and generation tasks demonstrate that UniFlow achieves a win-win outcome. For instance, our 7B UniFlow-XL not only surpasses the 14B TokenFlow-XL by 7.75% on average understanding benchmarks, but also achieves competitive results in both visual reconstruction and generation, surpassing UniTok by 0.15 in rFID and 0.09 in gFID (without guidance), respectively.
Related papers
- VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction [83.50898344094153]
VQRAE produces Continuous semantic features for image understanding and tokens for visual generation within a unified tokenizer.<n>Design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens.<n>VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction.
arXiv Detail & Related papers (2025-11-28T17:26:34Z) - Learning to Expand Images for Efficient Visual Autoregressive Modeling [26.400433163290586]
We introduce Expanding Autoregressive Representation (EAR), a novel generation paradigm that emulates the human visual system's center-outward perception pattern.<n>EAR unfolds image tokens in a spiral order from the center and progressively expands outward, preserving spatial continuity and enabling efficient parallel decoding.
arXiv Detail & Related papers (2025-11-19T14:55:07Z) - Blockwise Flow Matching: Improving Flow Matching Models For Efficient High-Quality Generation [33.177998521195114]
Flow Matching models have pushed the boundaries of high-fidelity data generation across a wide range of domains.<n>We propose Blockwise Flow Matching (BFM), a novel framework that partitions the generative trajectory into multiple temporal segments.<n>BFM achieves 2.1x to 4.9x accelerations in inference complexity at comparable generation performance.
arXiv Detail & Related papers (2025-10-24T05:41:23Z) - Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models [37.59115132356727]
We propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation.<n>On ImageNet 256$times$256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs.<n>Our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.
arXiv Detail & Related papers (2025-09-29T17:57:39Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation [26.29803524047736]
TokenFlow is a novel unified image tokenizer that bridges the gap between multimodal understanding and generation.<n>We demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance.<n>We also establish state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256*256 resolution.
arXiv Detail & Related papers (2024-12-04T06:46:55Z) - FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer [22.99322654701362]
We introduce FlowerFormer, a powerful graph transformer that incorporates the information flows within a neural architecture.
Our experiments demonstrate the superiority of FlowerFormer over existing neural encoding methods.
arXiv Detail & Related papers (2024-03-19T15:21:10Z) - Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers [13.480259378415505]
BiXT scales linearly with input size in terms of computational cost and memory consumption.
BiXT is inspired by the Perceiver architectures but replaces iterative attention with an efficient bi-directional cross-attention module.
By combining efficiency with the generality and performance of a full Transformer architecture, BiXT can process longer sequences.
arXiv Detail & Related papers (2024-02-19T13:38:15Z) - In-N-Out Generative Learning for Dense Unsupervised Video Segmentation [89.21483504654282]
In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos.
We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics.
Our INO outperforms previous state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-03-29T07:56:21Z) - Adaptive Fourier Neural Operators: Efficient Token Mixers for
Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain.
AFNO is based on a principled foundation of operator learning.
It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.