UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
- URL: http://arxiv.org/abs/2504.04423v1
- Date: Sun, 06 Apr 2025 09:20:49 GMT
- Title: UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
- Authors: Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, Yu-Gang Jiang,
- Abstract summary: We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations.<n>Our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information.
- Score: 84.87802580670579
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations, enabling seamless integration of unified visual understanding and image generation tasks. Unlike previous approaches that rely on unilateral visual representations, our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information that empowers heterogeneous tasks to selectively assimilate domain-specific knowledge based on their inherent characteristics. Through in-depth experiments, we uncover key principles for developing a unified model capable of both visual understanding and image generation. Extensive evaluations across a diverse range of prominent benchmarks demonstrate that UniToken achieves state-of-the-art performance, surpassing existing approaches. These results establish UniToken as a robust foundation for future research in this domain. The code and models are available at https://github.com/SxJyJay/UniToken.
Related papers
- Towards Generalized Multi-Image Editing for Unified Multimodal Models [56.620038824933566]
Unified Multimodal Models (UMMs) integrate multimodal understanding and generation.<n>UMMs are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images.<n>We propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts.
arXiv Detail & Related papers (2026-01-09T06:42:49Z) - TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement [87.82338951215131]
TokenAR is a simple but effective token-level enhancement mechanism to address reference identity confusion problem.<n>Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens.<n>The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.
arXiv Detail & Related papers (2025-10-18T03:36:26Z) - Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer [50.69959748410398]
We introduce MingTok, a new family of visual tokenizers with a continuous latent space for unified autoregressive generation and understanding.<n>MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction.<n>Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm.
arXiv Detail & Related papers (2025-10-08T02:50:14Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies [25.77487827338777]
A vision tokenizer trained for reconstruction excels at capturing low-level perceptual details.<n>A vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks.<n>We propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer.
arXiv Detail & Related papers (2025-03-18T14:56:46Z) - Unified Autoregressive Visual Generation and Understanding with Continuous Tokens [52.21981295470491]
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding.
Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image.
We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other.
arXiv Detail & Related papers (2025-03-17T17:58:30Z) - SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation [71.68085485928007]
We introduce SemHiTok, a unified image tokenizer via Semantic-Guided Hierarchical codebook.<n>We show that SemHiTok achieves SOTA performance in image reconstruction and multimodal understanding under LLaVA-v1.5 setting.<n>We also develop a unified MLLM with SemHiTok, which exhibits superior performance across multimodal understanding and generation tasks.
arXiv Detail & Related papers (2025-03-09T20:42:34Z) - UniTok: A Unified Tokenizer for Visual Generation and Understanding [69.09699034036124]
Visual generative and understanding models typically rely on distinct tokenizers to process images.<n>We introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism.<n>In terms of final performance, UniTok sets a new record of 0.38 rFID and 78.6% zero-shot accuracy on ImageNet.
arXiv Detail & Related papers (2025-02-27T17:47:01Z) - VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model [38.61292051733335]
We present VARGPT, a novel multimodal large language model (MLLM) that unifies visual understanding and generation within a single autoregressive framework.<n>VarGPT employs a next-token prediction paradigm for visual understanding and a next-scale prediction paradigm for visual autoregressive generation.<n> Notably, VARGPT naturally supports capabilities in autoregressive visual generation and instruction-to-image synthesis, showcasing its versatility in both visual understanding and generation tasks.
arXiv Detail & Related papers (2025-01-21T17:50:43Z) - Importance-Based Token Merging for Efficient Image and Video Generation [41.94334394794811]
We show that preserving high-information tokens during merging significantly improves sample quality.
We propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation.
arXiv Detail & Related papers (2024-11-23T02:01:49Z) - More Pictures Say More: Visual Intersection Network for Open Set Object Detection [4.206612461069489]
We introduce a strong DETR-based model, Visual Intersection Network for Open Set Object Detection (VINO)
VINO constructs a multi-image visual bank to preserve the semantic intersections of each category across all time steps.
Our approach guarantees a more precise alignment between target category semantics and region semantics, while significantly reducing pre-training time and resource demands.
arXiv Detail & Related papers (2024-08-26T05:52:35Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Generating Annotated High-Fidelity Images Containing Multiple Coherent
Objects [10.783993190686132]
We propose a multi-object generation framework that can synthesize images with multiple objects without explicitly requiring contextual information.
We demonstrate how coherency and fidelity are preserved with our method through experiments on the Multi-MNIST and CLEVR datasets.
arXiv Detail & Related papers (2020-06-22T11:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.