Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability
- URL: http://arxiv.org/abs/2602.03339v1
- Date: Tue, 03 Feb 2026 10:02:51 GMT
- Title: Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability
- Authors: Bingchen Zhao, Qiushan Guo, Ye Wang, Yixuan Huang, Zhonghua Zhai, Yu Tian,
- Abstract summary: We introduce CompTok, a training framework for learning visual tokenizers whose tokens are enhanced for compositionality.<n>By employing an InfoGAN-style objective, we train a recognition model to predict the tokens used to condition a diffusion decoder.<n>We show in experiments that CompTok can improve on both of the metrics as well as supporting state-of-the-art generators for class conditioned generation.
- Score: 30.139325285692568
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce CompTok, a training framework for learning visual tokenizers whose tokens are enhanced for compositionality. CompTok uses a token-conditioned diffusion decoder. By employing an InfoGAN-style objective, where we train a recognition model to predict the tokens used to condition the diffusion decoder using the decoded images, we enforce the decoder to not ignore any of the tokens. To promote compositional control, besides the original images, CompTok also trains on tokens formed by swapping token subsets between images, enabling more compositional control of the token over the decoder. As the swapped tokens between images do not have ground truth image targets, we apply a manifold constraint via an adversarial flow regularizer to keep unpaired swap generations on the natural-image distribution. The resulting tokenizer not only achieves state-of-the-art performance on image class-conditioned generation, but also demonstrates properties such as swapping tokens between images to achieve high level semantic editing of an image. Additionally, we propose two metrics that measures the landscape of the token space that can be useful to describe not only the compositionality of the tokens, but also how easy to learn the landscape is for a generator to be trained on this space. We show in experiments that CompTok can improve on both of the metrics as well as supporting state-of-the-art generators for class conditioned generation.
Related papers
- Improving Flexible Image Tokenizers for Autoregressive Image Generation [53.238708824055664]
textbfReToK is a flexible tokenizer with underlineRedundant underlineToken Padding and Hierarchical Semantic Regularization.<n>Our method achieves superior generation performance compared with both flexible and fixed-length tokenizers.
arXiv Detail & Related papers (2026-01-04T14:11:45Z) - Switchable Token-Specific Codebook Quantization For Face Image Compression [72.44596412563503]
We propose a Switchable Token-Specific Codebook Quantization for face image compression.<n>By recording the codebook group to which each token belongs with a small number of bits, our method can reduce the loss incurred when decreasing the size of each codebook group.<n>Our method has demonstrated its effectiveness on face recognition datasets, achieving an average accuracy of 93.51% for reconstructed images at 0.05 bpp.
arXiv Detail & Related papers (2025-10-27T02:56:17Z) - Hita: Holistic Tokenizer for Autoregressive Image Generation [56.81871174745175]
We introduce textitHita, a novel image tokenizer for autoregressive (AR) image generation.<n>It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens.
arXiv Detail & Related papers (2025-07-03T06:44:26Z) - Highly Compressed Tokenizer Can Generate Without Training [0.5033155053523042]
1D image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens.<n>We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities.<n>Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.
arXiv Detail & Related papers (2025-06-09T21:45:03Z) - ImageFolder: Autoregressive Image Generation with Folded Tokens [51.815319504939396]
Increasing token length is a common approach to improve the image reconstruction quality.<n>There exists a trade-off between reconstruction and generation quality regarding token length.<n>We propose Image, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling.
arXiv Detail & Related papers (2024-10-02T17:06:39Z) - Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting [8.572133295533643]
We present a method for large-mask pluralistic image inpainting based on the generative framework of discrete latent codes.
Our method learns latent priors, discretized as tokens, by only performing computations at the visible locations of the image.
arXiv Detail & Related papers (2024-03-27T01:28:36Z) - TokenCompose: Text-to-Image Diffusion with Token-level Supervision [43.307556249485366]
TokenCompose is a Latent Diffusion Model for text-to-image generation.
It achieves enhanced consistency between user-specified text prompts and model-generated images.
arXiv Detail & Related papers (2023-12-06T17:13:15Z) - Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs)
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies.
We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z) - Improving Visual Quality of Image Synthesis by A Token-based Generator
with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem.
The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.