Images are Worth Variable Length of Representations
- URL: http://arxiv.org/abs/2506.03643v2
- Date: Thu, 05 Jun 2025 10:20:34 GMT
- Title: Images are Worth Variable Length of Representations
- Authors: Lingjun Mao, Rodolfo Corona, Xin Liang, Wenhao Yan, Zineng Tang,
- Abstract summary: Most vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information.<n>We propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens to reconstruct each image.<n>Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality.
- Score: 13.136831256070343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction. Our code and checkpoints are available at https://dove-encoder.github.io/dove-encoder.
Related papers
- Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Adaptive Length Image Tokenization via Recurrent Allocation [81.10081670396956]
Current vision systems assign fixed-length representations to images, regardless of the information content.
Inspired by this, we propose an approach to learn variable-length token representations for 2D images.
arXiv Detail & Related papers (2024-11-04T18:58:01Z) - ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.<n>During inference, ElasticTok can dynamically allocate tokens when needed.<n>Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z) - ImageFolder: Autoregressive Image Generation with Folded Tokens [51.815319504939396]
Increasing token length is a common approach to improve the image reconstruction quality.<n>There exists a trade-off between reconstruction and generation quality regarding token length.<n>We propose Image, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling.
arXiv Detail & Related papers (2024-10-02T17:06:39Z) - TCFormer: Visual Recognition via Token Clustering Transformer [79.24723479088097]
We propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning.
Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens.
arXiv Detail & Related papers (2024-07-16T02:26:18Z) - HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning [25.728621355173626]
We propose to regard the encodings as augmented views of the input image.
The image captioning model encodes each view independently with a shared encoder efficiently.
We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts.
arXiv Detail & Related papers (2023-05-25T17:50:17Z) - SparseFormer: Sparse Visual Recognition via Limited Latent Tokens [30.494412497158237]
We present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner.
SparseFormer circumvents most of dense operations on the image space and has much lower computational costs.
Experiments on the ImageNet classification benchmark dataset show that SparseFormer achieves performance on par with canonical or well-established models.
arXiv Detail & Related papers (2023-04-07T17:59:58Z) - Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration.
ART presents both dense and sparse attention modules in the network.
We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.