Related papers: Language-Guided Image Tokenization for Generation

Language-Guided Image Tokenization for Generation

URL: http://arxiv.org/abs/2412.05796v1
Date: Sun, 08 Dec 2024 03:18:17 GMT
Title: Language-Guided Image Tokenization for Generation
Authors: Kaiwen Zha, Lijun Yu, Alireza Fathi, David A. Ross, Cordelia Schmid, Dina Katabi, Xiuye Gu,
Abstract summary: TexTok is a simple yet effective tokenization framework that leverages language to provide high-level semantics.<n>By conditioning the tokenization process on descriptive text captions, TexTok allows the tokenization process to focus on encoding fine-grained visual details into latent tokens.<n>TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively.
Score: 63.0859685332583
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide high-level semantics. By conditioning the tokenization process on descriptive text captions, TexTok allows the tokenization process to focus on encoding fine-grained visual details into latent tokens, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2% and 48.1% on ImageNet-256 and -512 benchmarks respectively, across varying numbers of tokens. These tokenization improvements consistently translate to 16.3% and 34.3% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve a 93.5x inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively. Furthermore, we demonstrate TexTok's superiority on the text-to-image generation task, effectively utilizing the off-the-shelf text captions in tokenization.

Related papers

FlowTok: Flowing Seamlessly Across Text and Image Tokens [20.629139911638646]
FlowTok is a framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. It reduces the latent space size by 3.3x at an image resolution of 256, eliminating the need for complex conditioning mechanisms or noise scheduling. With its streamlined architecture centered around compact 1D tokens, FlowTok is highly memory-efficient, requires significantly fewer training resources, and achieves much faster sampling speeds-all while delivering performance comparable to state-of-the-art models.
arXiv Detail & Related papers (2025-03-13T18:06:13Z)
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length [16.76602756308683]
We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences. We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer.
arXiv Detail & Related papers (2025-02-19T18:59:44Z)
An Image is Worth 32 Tokens for Reconstruction and Generation [54.24414696392026]
Transformer-based 1-Dimensional Tokenizer (TiTok) is an innovative approach that tokenizes images into 1D latent sequences. TiTok achieves competitive performance to state-of-the-art approaches. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
arXiv Detail & Related papers (2024-06-11T17:59:56Z)
Transformer based Pluralistic Image Completion with Reduced Information Loss [72.92754600354199]
Transformer based methods have achieved great success in image inpainting recently. They regard each pixel as a token, thus suffering from an information loss issue. We propose a new transformer based framework called "PUT"
arXiv Detail & Related papers (2024-03-31T01:20:16Z)
A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical Document Image Enhancement [13.27528507177775]
We propose textbfT2T-BinFormer which is a novel document binarization encoder-decoder architecture based on a Tokens-to-token vision transformer. Experiments on various DIBCO and H-DIBCO benchmarks demonstrate that the proposed model outperforms the existing CNN and ViT-based state-of-the-art methods.
arXiv Detail & Related papers (2023-12-06T23:01:11Z)
Perceptual Image Compression with Cooperative Cross-Modal Side Information [53.356714177243745]
We propose a novel deep image compression method with text-guided side information to achieve a better rate-perception-distortion tradeoff. Specifically, we employ the CLIP text encoder and an effective Semantic-Spatial Aware block to fuse the text and image features.
arXiv Detail & Related papers (2023-11-23T08:31:11Z)
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem. Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z)
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z)
FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing. First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space. We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.