Images are Worth Variable Length of Representations
- URL: http://arxiv.org/abs/2506.03643v2
- Date: Thu, 05 Jun 2025 10:20:34 GMT
- Title: Images are Worth Variable Length of Representations
- Authors: Lingjun Mao, Rodolfo Corona, Xin Liang, Wenhao Yan, Zineng Tang,
- Abstract summary: Most vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information.<n>We propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens to reconstruct each image.<n>Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality.
- Score: 13.136831256070343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction. Our code and checkpoints are available at https://dove-encoder.github.io/dove-encoder.
Related papers
- Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability [30.139325285692568]
We introduce CompTok, a training framework for learning visual tokenizers whose tokens are enhanced for compositionality.<n>By employing an InfoGAN-style objective, we train a recognition model to predict the tokens used to condition a diffusion decoder.<n>We show in experiments that CompTok can improve on both of the metrics as well as supporting state-of-the-art generators for class conditioned generation.
arXiv Detail & Related papers (2026-02-03T10:02:51Z) - Improving Flexible Image Tokenizers for Autoregressive Image Generation [53.238708824055664]
textbfReToK is a flexible tokenizer with underlineRedundant underlineToken Padding and Hierarchical Semantic Regularization.<n>Our method achieves superior generation performance compared with both flexible and fixed-length tokenizers.
arXiv Detail & Related papers (2026-01-04T14:11:45Z) - TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts [6.465999214817427]
Growing number of visual tokens greatly increases inference cost.<n>Visual token pruning has emerged as a promising solution.<n>Our approach can reduce up to 80% of visual tokens while maintaining performance in long context settings.
arXiv Detail & Related papers (2025-12-28T02:40:56Z) - TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement [87.82338951215131]
TokenAR is a simple but effective token-level enhancement mechanism to address reference identity confusion problem.<n>Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens.<n>The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.
arXiv Detail & Related papers (2025-10-18T03:36:26Z) - Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Adaptive Length Image Tokenization via Recurrent Allocation [81.10081670396956]
Current vision systems assign fixed-length representations to images, regardless of the information content.
Inspired by this, we propose an approach to learn variable-length token representations for 2D images.
arXiv Detail & Related papers (2024-11-04T18:58:01Z) - ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.<n>During inference, ElasticTok can dynamically allocate tokens when needed.<n>Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z) - ImageFolder: Autoregressive Image Generation with Folded Tokens [51.815319504939396]
Increasing token length is a common approach to improve the image reconstruction quality.<n>There exists a trade-off between reconstruction and generation quality regarding token length.<n>We propose Image, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling.
arXiv Detail & Related papers (2024-10-02T17:06:39Z) - TCFormer: Visual Recognition via Token Clustering Transformer [79.24723479088097]
We propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning.
Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens.
arXiv Detail & Related papers (2024-07-16T02:26:18Z) - HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning [25.728621355173626]
We propose to regard the encodings as augmented views of the input image.
The image captioning model encodes each view independently with a shared encoder efficiently.
We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts.
arXiv Detail & Related papers (2023-05-25T17:50:17Z) - SparseFormer: Sparse Visual Recognition via Limited Latent Tokens [30.494412497158237]
We present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner.
SparseFormer circumvents most of dense operations on the image space and has much lower computational costs.
Experiments on the ImageNet classification benchmark dataset show that SparseFormer achieves performance on par with canonical or well-established models.
arXiv Detail & Related papers (2023-04-07T17:59:58Z) - Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration.
ART presents both dense and sparse attention modules in the network.
We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.