Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
- URL: http://arxiv.org/abs/2506.05289v2
- Date: Fri, 10 Oct 2025 16:20:43 GMT
- Title: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
- Authors: Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, Zheng-Jun Zha,
- Abstract summary: AliTok is a novel Aligned Tokenizer that alters the dependency structure of the token sequence.<n>Building upon AliTok, a standard decoder-only autoregressive model with just 177M parameters achieves a gFID of 1.44 and an IS of 319.5 on the ImageNet-256 benchmark.
- Score: 69.79418000132995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive image generation aims to predict the next token based on previous ones. However, this process is challenged by the bidirectional dependencies inherent in conventional image tokenizations, which creates a fundamental misalignment with the unidirectional nature of autoregressive models. To resolve this, we introduce AliTok, a novel Aligned Tokenizer that alters the dependency structure of the token sequence. AliTok employs a bidirectional encoder constrained by a causal decoder, a design that compels the encoder to produce a token sequence with both semantic richness and forward-dependency. Furthermore, by incorporating prefix tokens and employing a two-stage tokenizer training process to enhance reconstruction performance, AliTok achieves high fidelity and predictability simultaneously. Building upon AliTok, a standard decoder-only autoregressive model with just 177M parameters achieves a gFID of 1.44 and an IS of 319.5 on the ImageNet-256 benchmark. Scaling up to 662M parameters, our model reaches a gFID of 1.28, surpassing the state-of-the-art diffusion method while achieving a 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.
Related papers
- BIGFix: Bidirectional Image Generation with Token Fixing [21.40682276355247]
We propose a method for self-correcting image generation by iteratively refining sampled tokens.<n>We achieve this with a novel training scheme that injects random tokens in the context, improving robustness and enabling token fixing during sampling.<n>We evaluate our approach on image generation using the ImageNet-256 and CIFAR-10 datasets, as well as on video generation with UCF-101 and NuScenes, demonstrating substantial improvements across both modalities.
arXiv Detail & Related papers (2025-10-14T07:34:44Z) - REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization [130.46612643194973]
reAR is a simple training strategy introducing a token-wise regularization objective.<n>On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standardization-based tokenizer.<n>When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M)
arXiv Detail & Related papers (2025-10-06T02:48:13Z) - NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale [101.57871281101747]
NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks.<n>Our method shows strong performance in image editing, highlighting the power and versatility of our unified approach.
arXiv Detail & Related papers (2025-08-14T14:54:22Z) - HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation [91.08481618973111]
Visual Auto-Regressive modeling ( VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models.<n>We introduce Hierarchical Masked Auto-Regressive modeling (HMAR) to generate high-quality images with fast sampling.<n>HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor.
arXiv Detail & Related papers (2025-06-04T20:08:07Z) - D-AR: Diffusion via Autoregressive Models [21.03363985989625]
Diffusion via Autoregressive models (D-AR) is a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure.<n>Our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens.
arXiv Detail & Related papers (2025-05-29T17:09:25Z) - Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [85.82112629564942]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.<n>We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.<n>Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z) - Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction [4.900334213807624]
We show how to enjoy the benefits of large codebooks without making autoregressive modeling more difficult.<n>Our framework consists of two stages: (1) an autoregressive model that sequentially predicts coarse labels for each token in the sequence, and (2) an auxiliary model that simultaneously predicts fine-grained labels for all tokens conditioned on their coarse labels.
arXiv Detail & Related papers (2025-03-20T14:41:29Z) - Autoregressive Image Generation with Randomized Parallel Decoding [23.714192351237628]
ARPG is a novel visual autoregressive model that enables randomized parallel generation.<n>Our approach attains an FID of 1.94 with only 64 sampling steps, achieving over a 20-fold increase in throughput.
arXiv Detail & Related papers (2025-03-13T17:19:51Z) - Neighboring Autoregressive Modeling for Efficient Visual Generation [19.486745219466666]
Neighboring Autoregressive Modeling (NAR) is a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure.<n>To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads.<n>Experiments on ImageNet$256times 256$ and UCF101 demonstrate that NAR achieves 2.4$times$ and 8.6$times$ higher throughput respectively.
arXiv Detail & Related papers (2025-03-12T05:52:27Z) - Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis [57.7367843129838]
Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer.<n>We propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction.
arXiv Detail & Related papers (2025-03-11T12:09:11Z) - TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation [26.29803524047736]
TokenFlow is a novel unified image tokenizer that bridges the gap between multimodal understanding and generation.<n>We demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance.<n>We also establish state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256*256 resolution.
arXiv Detail & Related papers (2024-12-04T06:46:55Z) - MaskBit: Embedding-free Image Generation via Bit Tokens [54.827480008982185]
We present an empirical and systematic examination of VQGANs, leading to a modernized VQGAN.<n>Second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark.
arXiv Detail & Related papers (2024-09-24T16:12:12Z) - Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.