DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
- URL: http://arxiv.org/abs/2503.14324v2
- Date: Wed, 19 Mar 2025 12:58:33 GMT
- Title: DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
- Authors: Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, Kaicheng Yu,
- Abstract summary: A vision tokenizer trained for reconstruction excels at capturing low-level perceptual details.<n>A vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks.<n>We propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer.
- Score: 25.77487827338777
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level perceptual details, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks while demonstrating remarkable effectiveness in downstream MLLM understanding and generation tasks. Notably, we also show that DualToken, as a unified tokenizer, surpasses the naive combination of two distinct types vision encoders, providing superior performance within a unified MLLM.
Related papers
- UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding [84.87802580670579]
We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations.
Our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information.
arXiv Detail & Related papers (2025-04-06T09:20:49Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.
Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation [32.853205155830366]
SemHiTok is a unified image tokenizer via Semantic-Guided Hierarchical codebook.<n>We show that SemHiTok achieves state-of-the-art rFID score at 256X256resolution compared to other unified tokenizers.
arXiv Detail & Related papers (2025-03-09T20:42:34Z) - QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation [101.28446308930367]
Quantized Language-Image Pretraining (QLIP) combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding.<n>QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives.<n>We demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
arXiv Detail & Related papers (2025-02-07T18:59:57Z) - Towards the Next Frontier in Speech Representation Learning Using Disentanglement [34.21745744502759]
We propose a framework for Learning Disentangled Self Supervised (termed as Learn2Diss) representations of speech, which consists of frame-level and an utterance-level encoder modules.
We show that the proposed Learn2Diss achieves state-of-the-art results on a variety of tasks, with the frame-level encoder representations improving semantic tasks, while the utterance-level representations improve non-semantic tasks.
arXiv Detail & Related papers (2024-07-02T07:13:35Z) - DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition
with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance.
We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs.
We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.