Related papers: OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

URL: http://arxiv.org/abs/2406.09399v1
Date: Thu, 13 Jun 2024 17:59:26 GMT
Title: OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
Authors: Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, Yu-Gang Jiang,
Abstract summary: Tokenizer serves as a translator to map the intricate visual data into a compact latent space. This paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization.
Score: 95.29102596532854
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled architecture, which integrates window and causal attention for spatial and temporal modeling. To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics. OmniTokenizer, for the first time, handles both image and video inputs within a unified framework and proves the possibility of realizing their synergy. Extensive experiments demonstrate that OmniTokenizer achieves state-of-the-art (SOTA) reconstruction performance on various image and video datasets, e.g., 1.11 reconstruction FID on ImageNet and 42 reconstruction FVD on UCF-101, beating the previous SOTA methods by 13% and 26%, respectively. Additionally, we also show that when integrated with OmniTokenizer, both language model-based approaches and diffusion models can realize advanced visual synthesis performance, underscoring the superiority and versatility of our method. Code is available at https://github.com/FoundationVision/OmniTokenizer.

Related papers

UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model [50.68870074090426]
We introduce UniWeTok, a unified discrete tokenizer for Unified Multimodal Large Language Models.<n>For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens.<n>We propose a three-stage training framework designed to enhance UniWeTok's adaptability cross various image resolutions and perception-sensitive scenarios.
arXiv Detail & Related papers (2026-02-15T15:07:19Z)
AToken: A Unified Tokenizer for Vision [26.55839382749872]
We present AToken, the first unified visual tokenizer that achieves both high-fidelity reconstruction and semantic understanding.<n>By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens.
arXiv Detail & Related papers (2025-09-17T23:11:18Z)
Hita: Holistic Tokenizer for Autoregressive Image Generation [56.81871174745175]
We introduce textitHita, a novel image tokenizer for autoregressive (AR) image generation.<n>It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens.
arXiv Detail & Related papers (2025-07-03T06:44:26Z)
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer. Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding [84.87802580670579]
We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations. Our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information.
arXiv Detail & Related papers (2025-04-06T09:20:49Z)
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement [68.05833403672274]
Existing unified models have struggled to handle the three fundamental capabilities in a unified model: understanding, generation, and editing. ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves fine-grained textures and text-aligned semantics. We also employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution.
arXiv Detail & Related papers (2025-04-02T17:45:00Z)
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation [26.29803524047736]
TokenFlow is a novel unified image tokenizer that bridges the gap between multimodal understanding and generation. We demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance. We also establish state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256*256 resolution.
arXiv Detail & Related papers (2024-12-04T06:46:55Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation [122.63617171522316]
Large Language Models (LLMs) are the dominant models for generative tasks in language. We introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images.
arXiv Detail & Related papers (2023-10-09T14:10:29Z)
UMIFormer: Mining the Correlations between Similar Tokens for Multi-View 3D Reconstruction [9.874357856580447]
We propose a novel transformer network for Unstructured Multiple Images (UMIFormer) It exploits transformer blocks for decoupled intra-view encoding and designed blocks for token rectification. All tokens acquired from various branches are compressed into a fixed-size compact representation.
arXiv Detail & Related papers (2023-02-27T17:27:45Z)
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer. We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z)
Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm. Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks. We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z)
Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem. The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z)
Unified Image and Video Saliency Modeling [21.701431656717112]
We ask: Can image and video saliency modeling be approached via a unified model? We propose four novel domain adaptation techniques and an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300.
arXiv Detail & Related papers (2020-03-11T18:28:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.