Adaptive Length Image Tokenization via Recurrent Allocation
- URL: http://arxiv.org/abs/2411.02393v1
- Date: Mon, 04 Nov 2024 18:58:01 GMT
- Title: Adaptive Length Image Tokenization via Recurrent Allocation
- Authors: Shivam Duggal, Phillip Isola, Antonio Torralba, William T. Freeman,
- Abstract summary: Current vision systems assign fixed-length representations to images, regardless of the information content.
Inspired by this, we propose an approach to learn variable-length token representations for 2D images.
- Score: 81.10081670396956
- License:
- Abstract: Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.
Related papers
- FlexTok: Resampling Images into 1D Token Sequences of Flexible Length [16.76602756308683]
We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences.
We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer.
arXiv Detail & Related papers (2025-02-19T18:59:44Z) - CAT: Content-Adaptive Image Tokenization [92.2116487267877]
We introduce Content-Adaptive Tokenizer (CAT), which adjusts representation capacity based on the image content and encodes simpler images into fewer tokens.
We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image.
By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.
arXiv Detail & Related papers (2025-01-06T16:28:47Z) - Spectral Image Tokenizer [21.84385276311364]
Image tokenizers map images to sequences of discrete tokens.
We propose to tokenize the image spectrum instead, obtained from a discrete wavelet transform (DWT)
We evaluate the tokenizer metrics as multiscale image generation, text-guided image upsampling and editing.
arXiv Detail & Related papers (2024-12-12T18:59:31Z) - ImageFolder: Autoregressive Image Generation with Folded Tokens [51.815319504939396]
Increasing token length is a common approach to improve the image reconstruction quality.
There exists a trade-off between reconstruction and generation quality regarding token length.
We propose Image, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling.
arXiv Detail & Related papers (2024-10-02T17:06:39Z) - SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt [59.280491260635266]
Methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP.
Mymodel model learns a two-dimensional prompt token map with equal (or scaled) size to the image token map.
Our model can conduct individual prompting for different image tokens in a fine-grained manner.
arXiv Detail & Related papers (2023-12-16T08:23:43Z) - ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process [94.41510903676837]
We propose an Alternating Denoising Diffusion Process (ADDP) that integrates two spaces within a single representation learning framework.
In each denoising step, our method first decodes pixels from previous VQ tokens, then generates new VQ tokens from the decoded pixels.
The learned representations can be used to generate diverse high-fidelity images and also demonstrate excellent transfer performance on recognition tasks.
arXiv Detail & Related papers (2023-06-08T17:59:32Z) - Improving Visual Quality of Image Synthesis by A Token-based Generator
with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem.
The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.