Related papers: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

URL: http://arxiv.org/abs/2506.05289v2
Date: Fri, 10 Oct 2025 16:20:43 GMT
Title: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
Authors: Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, Zheng-Jun Zha,
Abstract summary: AliTok is a novel Aligned Tokenizer that alters the dependency structure of the token sequence.<n>Building upon AliTok, a standard decoder-only autoregressive model with just 177M parameters achieves a gFID of 1.44 and an IS of 319.5 on the ImageNet-256 benchmark.
Score: 69.79418000132995
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive image generation aims to predict the next token based on previous ones. However, this process is challenged by the bidirectional dependencies inherent in conventional image tokenizations, which creates a fundamental misalignment with the unidirectional nature of autoregressive models. To resolve this, we introduce AliTok, a novel Aligned Tokenizer that alters the dependency structure of the token sequence. AliTok employs a bidirectional encoder constrained by a causal decoder, a design that compels the encoder to produce a token sequence with both semantic richness and forward-dependency. Furthermore, by incorporating prefix tokens and employing a two-stage tokenizer training process to enhance reconstruction performance, AliTok achieves high fidelity and predictability simultaneously. Building upon AliTok, a standard decoder-only autoregressive model with just 177M parameters achieves a gFID of 1.44 and an IS of 319.5 on the ImageNet-256 benchmark. Scaling up to 662M parameters, our model reaches a gFID of 1.28, surpassing the state-of-the-art diffusion method while achieving a 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.

Related papers

BIGFix: Bidirectional Image Generation with Token Fixing [21.40682276355247]
We propose a method for self-correcting image generation by iteratively refining sampled tokens.<n>We achieve this with a novel training scheme that injects random tokens in the context, improving robustness and enabling token fixing during sampling.<n>We evaluate our approach on image generation using the ImageNet-256 and CIFAR-10 datasets, as well as on video generation with UCF-101 and NuScenes, demonstrating substantial improvements across both modalities.
arXiv Detail & Related papers (2025-10-14T07:34:44Z)
REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization [130.46612643194973]
reAR is a simple training strategy introducing a token-wise regularization objective.<n>On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standardization-based tokenizer.<n>When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M)
arXiv Detail & Related papers (2025-10-06T02:48:13Z)
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale [101.57871281101747]
NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks.<n>Our method shows strong performance in image editing, highlighting the power and versatility of our unified approach.
arXiv Detail & Related papers (2025-08-14T14:54:22Z)
HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation [91.08481618973111]
Visual Auto-Regressive modeling ( VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models.<n>We introduce Hierarchical Masked Auto-Regressive modeling (HMAR) to generate high-quality images with fast sampling.<n>HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor.
arXiv Detail & Related papers (2025-06-04T20:08:07Z)
D-AR: Diffusion via Autoregressive Models [21.03363985989625]
Diffusion via Autoregressive models (D-AR) is a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure.<n>Our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens.
arXiv Detail & Related papers (2025-05-29T17:09:25Z)
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [85.82112629564942]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.<n>We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.<n>Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z)
Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction [4.900334213807624]
We show how to enjoy the benefits of large codebooks without making autoregressive modeling more difficult.<n>Our framework consists of two stages: (1) an autoregressive model that sequentially predicts coarse labels for each token in the sequence, and (2) an auxiliary model that simultaneously predicts fine-grained labels for all tokens conditioned on their coarse labels.
arXiv Detail & Related papers (2025-03-20T14:41:29Z)
Autoregressive Image Generation with Randomized Parallel Decoding [23.714192351237628]
ARPG is a novel visual autoregressive model that enables randomized parallel generation.<n>Our approach attains an FID of 1.94 with only 64 sampling steps, achieving over a 20-fold increase in throughput.
arXiv Detail & Related papers (2025-03-13T17:19:51Z)
Neighboring Autoregressive Modeling for Efficient Visual Generation [19.486745219466666]
Neighboring Autoregressive Modeling (NAR) is a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure.<n>To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads.<n>Experiments on ImageNet$256times 256$ and UCF101 demonstrate that NAR achieves 2.4$times$ and 8.6$times$ higher throughput respectively.
arXiv Detail & Related papers (2025-03-12T05:52:27Z)
Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis [57.7367843129838]
Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer.<n>We propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction.
arXiv Detail & Related papers (2025-03-11T12:09:11Z)
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation [26.29803524047736]
TokenFlow is a novel unified image tokenizer that bridges the gap between multimodal understanding and generation.<n>We demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance.<n>We also establish state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256*256 resolution.
arXiv Detail & Related papers (2024-12-04T06:46:55Z)
MaskBit: Embedding-free Image Generation via Bit Tokens [54.827480008982185]
We present an empirical and systematic examination of VQGANs, leading to a modernized VQGAN.<n>Second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark.
arXiv Detail & Related papers (2024-09-24T16:12:12Z)
Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.