Single-pass Adaptive Image Tokenization for Minimum Program Search
- URL: http://arxiv.org/abs/2507.07995v1
- Date: Thu, 10 Jul 2025 17:59:53 GMT
- Title: Single-pass Adaptive Image Tokenization for Minimum Program Search
- Authors: Shivam Duggal, Sanghyun Byun, William T. Freeman, Antonio Torralba, Phillip Isola,
- Abstract summary: We propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass.<n> KARL matches the performance of recent adaptive tokenizers while operating in a single pass.
- Score: 75.59409288259151
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: According to Algorithmic Information Theory (AIT) -- Intelligent representations compress data into the shortest possible program that can reconstruct its content, exhibiting low Kolmogorov Complexity (KC). In contrast, most visual representation learning systems use fixed-length representations for all inputs, ignoring variations in complexity or familiarity. Recent adaptive tokenization methods address this by allocating variable-length representations but typically require test-time search over multiple encodings to find the most predictive one. Inspired by Kolmogorov Complexity principles, we propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass, halting once its approximate KC is reached. The token count serves as a proxy for the minimum description length. KARL's training procedure closely resembles the Upside-Down Reinforcement Learning paradigm, as it learns to conditionally predict token halting based on a desired reconstruction quality. KARL matches the performance of recent adaptive tokenizers while operating in a single pass. We present scaling laws for KARL, analyzing the role of encoder/decoder size, continuous vs. discrete tokenization and more. Additionally, we offer a conceptual study drawing an analogy between Adaptive Image Tokenization and Algorithmic Information Theory, examining the predicted image complexity (KC) across axes such as structure vs. noise and in- vs. out-of-distribution familiarity -- revealing alignment with human intuition.
Related papers
- "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression [1.7942265700058988]
We introduce One-D-Piece, a discrete image tokenizer for variable-length tokenization.<n>Tail Token Drop is a regularization mechanism named "Tail Token Drop" into discrete one-dimensional image tokenizers.<n>We evaluate our tokenizer across multiple reconstruction quality metrics and find that it delivers significantly better perceptual quality than existing quality-controllable compression methods.
arXiv Detail & Related papers (2025-01-17T09:29:33Z) - CAT: Content-Adaptive Image Tokenization [92.2116487267877]
We introduce Content-Adaptive Tokenizer (CAT), which adjusts representation capacity based on the image content and encodes simpler images into fewer tokens.<n>We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image.<n>By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.
arXiv Detail & Related papers (2025-01-06T16:28:47Z) - Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.<n>To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.<n>We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - Subobject-level Image Tokenization [60.80949852899857]
Patch-based image tokenization ignores the morphology of the visual world.<n>Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation.<n>We show that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.
arXiv Detail & Related papers (2024-02-22T06:47:44Z) - Minimum Description Length and Generalization Guarantees for
Representation Learning [16.2444595840653]
This paper presents a framework that allows us to derive upper bounds on the generalization error of a representation learning algorithm.
Rather than the mutual information between the encoder's input and the representation, our new bounds involve the "multi-letter" relative entropy.
To the best knowledge of the authors, the established generalization bounds are the first of their kind for Information Bottleneck (IB) type encoders and representation learning.
arXiv Detail & Related papers (2024-02-05T18:12:28Z) - Keyword Spotting Simplified: A Segmentation-Free Approach using
Character Counting and CTC re-scoring [8.6134769826665]
Recent advances in segmentation-free keyword spotting treat this problem w.r.t. as an object detection paradigm.
We propose a novel segmentation-free system that efficiently scans a document image to find rectangular areas that include the query information.
arXiv Detail & Related papers (2023-08-07T12:11:04Z) - A Learning Framework for Diffeomorphic Image Registration based on
Quasi-conformal Geometry [1.2891210250935146]
We propose the quasi-conformal registration network (QCRegNet), an unsupervised learning framework, to obtain diffeomorphic 2D image registrations.
QCRegNet consists of the estimator network and the Beltrami solver network (BSNet)
Results show that the registration accuracy is comparable to state-of-the-art methods and diffeomorphism is to a great extent guaranteed.
arXiv Detail & Related papers (2021-10-20T14:23:24Z) - SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive
Connection [51.376723069962]
We present a method for accelerating and structuring self-attentions: Sparse Adaptive Connection.
In SAC, we regard the input sequence as a graph and attention operations are performed between linked nodes.
We show that SAC is competitive with state-of-the-art models while significantly reducing memory cost.
arXiv Detail & Related papers (2020-03-22T07:58:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.