Image Understanding Makes for A Good Tokenizer for Image Generation
- URL: http://arxiv.org/abs/2411.04406v1
- Date: Thu, 07 Nov 2024 03:55:23 GMT
- Title: Image Understanding Makes for A Good Tokenizer for Image Generation
- Authors: Luting Wang, Yang Zhao, Zijian Zhang, Jiashi Feng, Si Liu, Bingyi Kang,
- Abstract summary: We introduce a token-based IG framework, which relies on effective tokenizers to project images into token sequences.
We show that tokenizers with strong IU capabilities achieve superior IG performance across a variety of metrics, datasets, tasks, and proposal networks.
- Score: 62.875788091204626
- License:
- Abstract: Abstract Modern image generation (IG) models have been shown to capture rich semantics valuable for image understanding (IU) tasks. However, the potential of IU models to improve IG performance remains uncharted. We address this issue using a token-based IG framework, which relies on effective tokenizers to project images into token sequences. Currently, pixel reconstruction (e.g., VQGAN) dominates the training objective for image tokenizers. In contrast, our approach adopts the feature reconstruction objective, where tokenizers are trained by distilling knowledge from pretrained IU encoders. Comprehensive comparisons indicate that tokenizers with strong IU capabilities achieve superior IG performance across a variety of metrics, datasets, tasks, and proposal networks. Notably, VQ-KD CLIP achieves $4.10$ FID on ImageNet-1k (IN-1k). Visualization suggests that the superiority of VQ-KD can be partly attributed to the rich semantics within the VQ-KD codebook. We further introduce a straightforward pipeline to directly transform IU encoders into tokenizers, demonstrating exceptional effectiveness for IG tasks. These discoveries may energize further exploration into image tokenizer research and inspire the community to reassess the relationship between IU and IG. The code is released at https://github.com/magic-research/vector_quantization.
Related papers
- GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting [64.84383010238908]
We propose an effective image tokenizer with 2D Gaussian Splatting as a solution.
In general, our framework integrates the local influence of 2D Gaussian distribution into the discrete space.
Competitive reconstruction performances on CIFAR, Mini-Net, and ImageNet-1K demonstrate the effectiveness of our framework.
arXiv Detail & Related papers (2025-01-26T17:56:11Z) - TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation [26.29803524047736]
TokenFlow is a novel unified image tokenizer that bridges the gap between multimodal understanding and generation.
We demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance.
We also establish state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256*256 resolution.
arXiv Detail & Related papers (2024-12-04T06:46:55Z) - Factorized Visual Tokenization and Generation [37.56136469262736]
We introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks.
This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization.
Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-11-25T18:59:53Z) - SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook [9.993066868670283]
We introduce SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning.
Our approach constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics.
arXiv Detail & Related papers (2024-09-09T23:12:43Z) - Rejuvenating image-GPT as Strong Visual Representation Learners [28.77567067712619]
This paper enhances image-GPT, one of the pioneering works that introduce autoregressive pretraining to predict the next pixels.
We shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content.
Experiments showcase that D-iGPT excels as a strong learner of visual representations.
arXiv Detail & Related papers (2023-12-04T18:59:20Z) - ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process [94.41510903676837]
We propose an Alternating Denoising Diffusion Process (ADDP) that integrates two spaces within a single representation learning framework.
In each denoising step, our method first decodes pixels from previous VQ tokens, then generates new VQ tokens from the decoded pixels.
The learned representations can be used to generate diverse high-fidelity images and also demonstrate excellent transfer performance on recognition tasks.
arXiv Detail & Related papers (2023-06-08T17:59:32Z) - Unpaired Image Captioning by Image-level Weakly-Supervised Visual
Concept Recognition [83.93422034664184]
Unpaired image captioning (UIC) is to describe images without using image-caption pairs in the training phase.
Most existing studies use off-the-shelf algorithms to obtain the visual concepts.
We propose a novel approach to achieve cost-effective UIC using image-level labels.
arXiv Detail & Related papers (2022-03-07T08:02:23Z) - Vector-quantized Image Modeling with Improved VQGAN [93.8443646643864]
We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively.
We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
arXiv Detail & Related papers (2021-10-09T18:36:00Z) - Attention-Driven Dynamic Graph Convolutional Network for Multi-Label
Image Recognition [53.17837649440601]
We propose an Attention-Driven Dynamic Graph Convolutional Network (ADD-GCN) to dynamically generate a specific graph for each image.
Experiments on public multi-label benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2020-12-05T10:10:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.