Hita: Holistic Tokenizer for Autoregressive Image Generation
- URL: http://arxiv.org/abs/2507.02358v4
- Date: Fri, 11 Jul 2025 09:06:39 GMT
- Title: Hita: Holistic Tokenizer for Autoregressive Image Generation
- Authors: Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, Xiaojuan Qi,
- Abstract summary: We introduce textitHita, a novel image tokenizer for autoregressive (AR) image generation.<n>It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens.
- Score: 56.81871174745175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vanilla autoregressive image generation models generate visual tokens step-by-step, limiting their ability to capture holistic relationships among token sequences. Moreover, because most visual tokenizers map local image patches into latent tokens, global information is limited. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Hita incorporates two key strategies to better align with the AR generation process: 1) {arranging} a sequential structure with holistic tokens at the beginning, followed by patch-level tokens, and using causal attention to maintain awareness of previous tokens; and 2) adopting a lightweight fusion module before feeding the de-quantized tokens into the decoder to control information flow and prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. Detailed analysis of the holistic representation highlights its ability to capture global image properties, such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}.
Related papers
- Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability [30.139325285692568]
We introduce CompTok, a training framework for learning visual tokenizers whose tokens are enhanced for compositionality.<n>By employing an InfoGAN-style objective, we train a recognition model to predict the tokens used to condition a diffusion decoder.<n>We show in experiments that CompTok can improve on both of the metrics as well as supporting state-of-the-art generators for class conditioned generation.
arXiv Detail & Related papers (2026-02-03T10:02:51Z) - ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation [64.84095852784714]
Residual Tokenizer (ResTok) is a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens.<n>We show that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps.
arXiv Detail & Related papers (2026-01-07T14:09:18Z) - Improving Flexible Image Tokenizers for Autoregressive Image Generation [53.238708824055664]
textbfReToK is a flexible tokenizer with underlineRedundant underlineToken Padding and Hierarchical Semantic Regularization.<n>Our method achieves superior generation performance compared with both flexible and fixed-length tokenizers.
arXiv Detail & Related papers (2026-01-04T14:11:45Z) - TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement [87.82338951215131]
TokenAR is a simple but effective token-level enhancement mechanism to address reference identity confusion problem.<n>Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens.<n>The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.
arXiv Detail & Related papers (2025-10-18T03:36:26Z) - Group Critical-token Policy Optimization for Autoregressive Image Generation [32.472222192052044]
Key obstacle lies in how to identify more critical image tokens during AR generation and implement effective token-wise optimization for them.<n>We identify the critical tokens in RLVR-based AR generation from three perspectives, specifically: $textbf(1)$ Causal dependency: early tokens fundamentally determine the later tokens and final image effect due to unidirectional dependency; $textbf(2)$ Entropy-induced spatial structure: tokens with high entropy gradients correspond to image structure and bridges distinct visual regions.<n>Experiments on multiple text-to-image benchmarks for both AR models and unified multimodal models demonstrate the effectiveness
arXiv Detail & Related papers (2025-09-26T15:33:18Z) - Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model [0.0]
hallucinations often arise from the progressive weakening of attention weight to visual tokens.<n>textbfPAINT (textbfPaying textbfAttention to textbfINformed textbfTokens) is a plug-and-play framework that intervenes in the self-attention mechanism of the Large Vision Language Models.
arXiv Detail & Related papers (2025-01-21T15:22:31Z) - SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization [20.109136454526233]
We propose SweetTok, a novel video tokenizer to overcome the limitations in current video tokenization methods.<n>SweetTok compress visual inputs through distinct spatial and temporal queries via textbfDecoupled textbfAutotextbfEncoder (DQAE)<n>We show that SweetTok significantly improves video reconstruction results by textbf42.8% w.r.t rFVD on UCF-101 dataset.
arXiv Detail & Related papers (2024-12-11T13:48:06Z) - OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation [95.29102596532854]
Tokenizer serves as a translator to map the intricate visual data into a compact latent space.
This paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization.
arXiv Detail & Related papers (2024-06-13T17:59:26Z) - SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt [59.280491260635266]
Methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP.
Mymodel model learns a two-dimensional prompt token map with equal (or scaled) size to the image token map.
Our model can conduct individual prompting for different image tokens in a fine-grained manner.
arXiv Detail & Related papers (2023-12-16T08:23:43Z) - Character-Centric Story Visualization via Visual Planning and Token
Alignment [53.44760407148918]
Story visualization advances the traditional text-to-image generation by enabling multiple image generation based on a complete story.
Key challenge of consistent story visualization is to preserve characters that are essential in stories.
We propose to adapt a recent work that augments Vector-Quantized Variational Autoencoders with a text-tovisual-token architecture.
arXiv Detail & Related papers (2022-10-16T06:50:39Z) - Improving Visual Quality of Image Synthesis by A Token-based Generator
with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem.
The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.