An Image is Worth 32 Tokens for Reconstruction and Generation
- URL: http://arxiv.org/abs/2406.07550v1
- Date: Tue, 11 Jun 2024 17:59:56 GMT
- Title: An Image is Worth 32 Tokens for Reconstruction and Generation
- Authors: Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen,
- Abstract summary: Transformer-based 1-Dimensional Tokenizer (TiTok) is an innovative approach that tokenizes images into 1D latent sequences.
TiTok achieves competitive performance to state-of-the-art approaches.
Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
- Score: 54.24414696392026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in generative models have highlighted the crucial role of image tokenization in the efficient synthesis of high-resolution images. Tokenization, which transforms images into latent representations, reduces computational demands compared to directly processing pixels and enhances the effectiveness and efficiency of the generation process. Prior methods, such as VQGAN, typically utilize 2D latent grids with fixed downsampling factors. However, these 2D tokenizations face challenges in managing the inherent redundancies present in images, where adjacent regions frequently display similarities. To overcome this issue, we introduce Transformer-based 1-Dimensional Tokenizer (TiTok), an innovative approach that tokenizes images into 1D latent sequences. TiTok provides a more compact latent representation, yielding substantially more efficient and effective representations than conventional techniques. For example, a 256 x 256 x 3 image can be reduced to just 32 discrete tokens, a significant reduction from the 256 or 1024 tokens obtained by prior methods. Despite its compact nature, TiTok achieves competitive performance to state-of-the-art approaches. Specifically, using the same generator framework, TiTok attains 1.97 gFID, outperforming MaskGIT baseline significantly by 4.21 at ImageNet 256 x 256 benchmark. The advantages of TiTok become even more significant when it comes to higher resolution. At ImageNet 512 x 512 benchmark, TiTok not only outperforms state-of-the-art diffusion model DiT-XL/2 (gFID 2.74 vs. 3.04), but also reduces the image tokens by 64x, leading to 410x faster generation process. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
Related papers
- FlexTok: Resampling Images into 1D Token Sequences of Flexible Length [16.76602756308683]
We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences.
We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer.
arXiv Detail & Related papers (2025-02-19T18:59:44Z) - SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer [45.720721058671856]
SoftVQ-VAE is a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token.
Our approach compresses 256x256 and 512x512 images using as few as 32 or 64 1-dimensional tokens.
Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256x256 images and 55x for 512x512 images.
arXiv Detail & Related papers (2024-12-14T20:29:29Z) - Language-Guided Image Tokenization for Generation [63.0859685332583]
TexTok is a simple yet effective tokenization framework that leverages language to provide high-level semantics.
By conditioning the tokenization process on descriptive text captions, TexTok allows the tokenization process to focus on encoding fine-grained visual details into latent tokens.
TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively.
arXiv Detail & Related papers (2024-12-08T03:18:17Z) - 3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation [4.221298212125194]
Variational Tokenizer (VAT) transforms unordered 3D data into compact latent tokens with an implicit hierarchy.
VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization.
arXiv Detail & Related papers (2024-12-03T06:31:25Z) - MaskBit: Embedding-free Image Generation via Bit Tokens [54.827480008982185]
We present an empirical and systematic examination of VQGANs, leading to a modernized VQGAN.
Second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark.
arXiv Detail & Related papers (2024-09-24T16:12:12Z) - CoordFill: Efficient High-Resolution Image Inpainting via Parameterized
Coordinate Querying [52.91778151771145]
In this paper, we try to break the limitations for the first time thanks to the recent development of continuous implicit representation.
Experiments show that the proposed method achieves real-time performance on the 2048$times$2048 images using a single GTX 2080 Ti GPU.
arXiv Detail & Related papers (2023-03-15T11:13:51Z) - CUF: Continuous Upsampling Filters [25.584630142930123]
In this paper, we consider one of the most important operations in image processing: upsampling.
We propose to parameterize upsampling kernels as neural fields.
This parameterization leads to a compact architecture that obtains a 40-fold reduction in the number of parameters when compared with competing arbitrary-scale super-resolution architectures.
arXiv Detail & Related papers (2022-10-13T12:45:51Z) - PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image
Generation [88.55256389703082]
Pixel is a promising research paradigm for image generation, which can well exploit pixel-wise prior knowledge for generation.
In this paper, we propose a progressive pixel synthesis network towards efficient image generation, as Pixel.
With much less expenditure, Pixel obtains new state-of-the-art (SOTA) performance on two benchmark datasets.
arXiv Detail & Related papers (2022-04-02T10:55:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.