An Image is Worth 32 Tokens for Reconstruction and Generation
- URL: http://arxiv.org/abs/2406.07550v1
- Date: Tue, 11 Jun 2024 17:59:56 GMT
- Title: An Image is Worth 32 Tokens for Reconstruction and Generation
- Authors: Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen,
- Abstract summary: Transformer-based 1-Dimensional Tokenizer (TiTok) is an innovative approach that tokenizes images into 1D latent sequences.
TiTok achieves competitive performance to state-of-the-art approaches.
Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
- Score: 54.24414696392026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in generative models have highlighted the crucial role of image tokenization in the efficient synthesis of high-resolution images. Tokenization, which transforms images into latent representations, reduces computational demands compared to directly processing pixels and enhances the effectiveness and efficiency of the generation process. Prior methods, such as VQGAN, typically utilize 2D latent grids with fixed downsampling factors. However, these 2D tokenizations face challenges in managing the inherent redundancies present in images, where adjacent regions frequently display similarities. To overcome this issue, we introduce Transformer-based 1-Dimensional Tokenizer (TiTok), an innovative approach that tokenizes images into 1D latent sequences. TiTok provides a more compact latent representation, yielding substantially more efficient and effective representations than conventional techniques. For example, a 256 x 256 x 3 image can be reduced to just 32 discrete tokens, a significant reduction from the 256 or 1024 tokens obtained by prior methods. Despite its compact nature, TiTok achieves competitive performance to state-of-the-art approaches. Specifically, using the same generator framework, TiTok attains 1.97 gFID, outperforming MaskGIT baseline significantly by 4.21 at ImageNet 256 x 256 benchmark. The advantages of TiTok become even more significant when it comes to higher resolution. At ImageNet 512 x 512 benchmark, TiTok not only outperforms state-of-the-art diffusion model DiT-XL/2 (gFID 2.74 vs. 3.04), but also reduces the image tokens by 64x, leading to 410x faster generation process. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
Related papers
- Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.
Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.
In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - FlexTok: Resampling Images into 1D Token Sequences of Flexible Length [16.76602756308683]
We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences.
We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer.
arXiv Detail & Related papers (2025-02-19T18:59:44Z) - SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer [45.720721058671856]
SoftVQ-VAE is a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token.
Our approach compresses 256x256 and 512x512 images using as few as 32 or 64 1-dimensional tokens.
Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256x256 images and 55x for 512x512 images.
arXiv Detail & Related papers (2024-12-14T20:29:29Z) - Language-Guided Image Tokenization for Generation [63.0859685332583]
TexTok is a simple yet effective tokenization framework that leverages language to provide a compact, high-level semantic representation.
Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2% and 48.1% on ImageNet-256 and -512 benchmarks respectively.
TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively.
arXiv Detail & Related papers (2024-12-08T03:18:17Z) - 3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation [4.221298212125194]
Variational Tokenizer (VAT) transforms unordered 3D data into compact latent tokens with an implicit hierarchy.
VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization.
arXiv Detail & Related papers (2024-12-03T06:31:25Z) - Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion [34.70370851239368]
We show that pixel-space models can in fact be very competitive to latent approaches both in quality and efficiency.
We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions.
arXiv Detail & Related papers (2024-10-25T06:20:06Z) - MaskBit: Embedding-free Image Generation via Bit Tokens [54.827480008982185]
We present an empirical and systematic examination of VQGANs, leading to a modernized VQGAN.
A novel embedding-free generation network operating directly on bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark, with a compact generator model of mere 305M parameters.
arXiv Detail & Related papers (2024-09-24T16:12:12Z) - Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models [26.926712014346432]
This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization.
Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512.
arXiv Detail & Related papers (2024-06-13T17:59:58Z) - A Pytorch Reproduction of Masked Generative Image Transformer [4.205139792076062]
We present a reproduction of MaskGIT: Masked Generative Image Transformer, using PyTorch.
The approach involves leveraging a masked bidirectional transformer architecture, enabling image generation with only few steps.
We achieve results that closely align with the findings presented in the original paper.
arXiv Detail & Related papers (2023-10-22T20:21:11Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - CoordFill: Efficient High-Resolution Image Inpainting via Parameterized
Coordinate Querying [52.91778151771145]
In this paper, we try to break the limitations for the first time thanks to the recent development of continuous implicit representation.
Experiments show that the proposed method achieves real-time performance on the 2048$times$2048 images using a single GTX 2080 Ti GPU.
arXiv Detail & Related papers (2023-03-15T11:13:51Z) - StraIT: Non-autoregressive Generation with Stratified Image Transformer [63.158996766036736]
Stratified Image Transformer(StraIT) is a pure non-autoregressive(NAR) generative model.
Our experiments demonstrate that StraIT significantly improves NAR generation and out-performs existing DMs and AR methods.
arXiv Detail & Related papers (2023-03-01T18:59:33Z) - CUF: Continuous Upsampling Filters [25.584630142930123]
In this paper, we consider one of the most important operations in image processing: upsampling.
We propose to parameterize upsampling kernels as neural fields.
This parameterization leads to a compact architecture that obtains a 40-fold reduction in the number of parameters when compared with competing arbitrary-scale super-resolution architectures.
arXiv Detail & Related papers (2022-10-13T12:45:51Z) - PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image
Generation [88.55256389703082]
Pixel is a promising research paradigm for image generation, which can well exploit pixel-wise prior knowledge for generation.
In this paper, we propose a progressive pixel synthesis network towards efficient image generation, as Pixel.
With much less expenditure, Pixel obtains new state-of-the-art (SOTA) performance on two benchmark datasets.
arXiv Detail & Related papers (2022-04-02T10:55:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.