CAT: Content-Adaptive Image Tokenization
- URL: http://arxiv.org/abs/2501.03120v1
- Date: Mon, 06 Jan 2025 16:28:47 GMT
- Title: CAT: Content-Adaptive Image Tokenization
- Authors: Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, Lili Yu, Chunting Zhou,
- Abstract summary: We introduce Content-Adaptive Tokenizer (CAT), which adjusts representation capacity based on the image content and encodes simpler images into fewer tokens.
We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image.
By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.
- Score: 92.2116487267877
- License:
- Abstract: Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image, taking into account factors critical to human perception. Trained on images with diverse compression ratios, CAT demonstrates robust performance in image reconstruction. We also utilize its variable-length latent representations to train Diffusion Transformers (DiTs) for ImageNet generation. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.
Related papers
- One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression [1.7942265700058988]
We introduce One-D-Piece, a discrete image tokenizer for variable-length tokenization.
Tail Token Drop is a regularization mechanism named "Tail Token Drop" into discrete one-dimensional image tokenizers.
We evaluate our tokenizer across multiple reconstruction quality metrics and find that it delivers significantly better perceptual quality than existing quality-controllable compression methods.
arXiv Detail & Related papers (2025-01-17T09:29:33Z) - ImageFolder: Autoregressive Image Generation with Folded Tokens [51.815319504939396]
Increasing token length is a common approach to improve the image reconstruction quality.
There exists a trade-off between reconstruction and generation quality regarding token length.
We propose Image, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling.
arXiv Detail & Related papers (2024-10-02T17:06:39Z) - DiT: Efficient Vision Transformers with Dynamic Token Routing [37.808078064528374]
We propose a data-dependent token routing strategy to elaborate the routing paths of image tokens for Dynamic Vision Transformer, dubbed DiT.
The proposed framework generates a data-dependent path per token, adapting to the object scales and visual discrimination of tokens.
In experiments, our DiT achieves superior performance and favorable complexity/accuracy trade-offs than many SoTA methods on ImageNet classification, object detection, instance segmentation, and semantic segmentation.
arXiv Detail & Related papers (2023-08-07T08:55:48Z) - HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning [25.728621355173626]
We propose to regard the encodings as augmented views of the input image.
The image captioning model encodes each view independently with a shared encoder efficiently.
We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts.
arXiv Detail & Related papers (2023-05-25T17:50:17Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z) - Generating Images with Sparse Representations [21.27273495926409]
High dimensionality of images presents architecture and sampling-efficiency challenges for likelihood-based generative models.
We present an alternative approach, inspired by common image compression methods like JPEG, and convert images to quantized discrete cosine transform (DCT) blocks.
We propose a Transformer-based autoregressive architecture, which is trained to sequentially predict the conditional distribution of the next element in such sequences.
arXiv Detail & Related papers (2021-03-05T17:56:03Z) - Visual Transformers: Token-based Image Representation and Processing for
Computer Vision [67.55770209540306]
Visual Transformer ( VT) operates in a semantic token space, judiciously attending to different image parts based on context.
Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts.
For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
arXiv Detail & Related papers (2020-06-05T20:49:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.