Related papers: ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation

ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation

URL: http://arxiv.org/abs/2601.03955v1
Date: Wed, 07 Jan 2026 14:09:18 GMT
Title: ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation
Authors: Xu Zhang, Cheng Da, Huan Yang, Kun Gai, Ming Lu, Zhan Ma,
Abstract summary: Residual Tokenizer (ResTok) is a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens.<n>We show that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps.
Score: 64.84095852784714
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing 1D visual tokenizers for autoregressive (AR) generation largely follow the design principles of language modeling, as they are built directly upon transformers whose priors originate in language, yielding single-hierarchy latent tokens and treating visual data as flat sequential token streams. However, this language-like formulation overlooks key properties of vision, particularly the hierarchical and residual network designs that have long been essential for convergence and efficiency in visual models. To bring "vision" back to vision, we propose the Residual Tokenizer (ResTok), a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens. The hierarchical representations obtained through progressively merging enable cross-level feature fusion at each layer, substantially enhancing representational capacity. Meanwhile, the semantic residuals between hierarchies prevent information overlap, yielding more concentrated latent distributions that are easier for AR modeling. Cross-level bindings consequently emerge without any explicit constraints. To accelerate the generation process, we further introduce a hierarchical AR generator that substantially reduces sampling steps by predicting an entire level of latent tokens at once rather than generating them strictly token-by-token. Extensive experiments demonstrate that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps. Code is available at https://github.com/Kwai-Kolors/ResTok.

Related papers

SFTok: Bridging the Performance Gap in Discrete Tokenizers [72.9996757048065]
We propose textbfSFTok, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction.<n>At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet.
arXiv Detail & Related papers (2025-12-18T18:59:04Z)
Learning to Expand Images for Efficient Visual Autoregressive Modeling [26.400433163290586]
We introduce Expanding Autoregressive Representation (EAR), a novel generation paradigm that emulates the human visual system's center-outward perception pattern.<n>EAR unfolds image tokens in a spiral order from the center and progressively expands outward, preserving spatial continuity and enabling efficient parallel decoding.
arXiv Detail & Related papers (2025-11-19T14:55:07Z)
TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement [87.82338951215131]
TokenAR is a simple but effective token-level enhancement mechanism to address reference identity confusion problem.<n>Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens.<n>The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.
arXiv Detail & Related papers (2025-10-18T03:36:26Z)
Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots [103.48424042986271]
We introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens.<n>We present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling.
arXiv Detail & Related papers (2025-05-26T17:59:07Z)
Plug-and-Play Context Feature Reuse for Efficient Masked Generation [36.563229330549284]
Masked generative models (MGMs) have emerged as a powerful framework for image synthesis.<n>We introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs.
arXiv Detail & Related papers (2025-05-25T10:57:35Z)
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation [81.58846231702026]
We introduce GigaTok, the first approach to improve image reconstruction, generation, and representation learning when scaling visual tokenizers.<n>We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma.<n>By scaling to $bf3 space billion$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
arXiv Detail & Related papers (2025-04-11T17:59:58Z)
Frequency Autoregressive Image Generation with Continuous Tokens [31.833852108014312]
We introduce the frequency progressive autoregressive (textbfFAR) paradigm and instantiate FAR with the continuous tokenizer.<n>We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset.
arXiv Detail & Related papers (2025-03-07T10:34:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.