Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation
- URL: http://arxiv.org/abs/2503.16430v3
- Date: Fri, 29 Aug 2025 02:25:47 GMT
- Title: Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation
- Authors: Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, Xihui Liu,
- Abstract summary: We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.<n>We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.<n>Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
- Score: 85.82112629564942
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. To achieve this, we decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from continuous representations. Specifically, we introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism that efficiently model the resulting large token space. Extensive experiments show that our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction. This work demonstrates that bridging discrete and continuous paradigms can effectively harness the strengths of both approaches, providing a promising direction for high-quality visual generation with simple autoregressive modeling. Project page: https://yuqingwang1029.github.io/TokenBridge.
Related papers
- Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics [0.7252027234425333]
We introduce a continuous autoregressive formulation of language generation in which tokens are represented as continuous vectors that emphmature over multiple update steps before being discretized.<n>We show that this maturation process alone is sufficient to produce coherent and diverse text using deterministic decoding (argmax)<n>Additional perturbations, such as dynamics or history smoothing, can be incorporated naturally but are not required for the model to function.
arXiv Detail & Related papers (2026-01-08T11:44:34Z) - ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation [64.84095852784714]
Residual Tokenizer (ResTok) is a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens.<n>We show that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps.
arXiv Detail & Related papers (2026-01-07T14:09:18Z) - BIGFix: Bidirectional Image Generation with Token Fixing [21.40682276355247]
We propose a method for self-correcting image generation by iteratively refining sampled tokens.<n>We achieve this with a novel training scheme that injects random tokens in the context, improving robustness and enabling token fixing during sampling.<n>We evaluate our approach on image generation using the ImageNet-256 and CIFAR-10 datasets, as well as on video generation with UCF-101 and NuScenes, demonstrating substantial improvements across both modalities.
arXiv Detail & Related papers (2025-10-14T07:34:44Z) - Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation [110.28291466364784]
Speculative Jacobi-Denoising Decoding (SJD2) is a framework that incorporates the denoising process into Jacobi to enable parallel token generation in autoregressive models.<n>Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings.
arXiv Detail & Related papers (2025-10-10T04:30:45Z) - NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale [101.57871281101747]
NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks.<n>Our method shows strong performance in image editing, highlighting the power and versatility of our unified approach.
arXiv Detail & Related papers (2025-08-14T14:54:22Z) - Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis [79.98107530577576]
DisCon is a novel framework that reinterprets discrete tokens as conditional signals rather than generation targets.<n>DisCon achieves a gFID score of 1.38 on ImageNet 256$times $256 generation, outperforming state-of-the-art autoregressive approaches by a clear margin.
arXiv Detail & Related papers (2025-07-02T14:33:52Z) - D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens [80.75893450536577]
We propose D2C, a novel two-stage method to enhance model generation capacity.
In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator.
In the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence.
arXiv Detail & Related papers (2025-03-21T13:58:49Z) - Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction [4.900334213807624]
We show how to enjoy the benefits of large codebooks without making autoregressive modeling more difficult.<n>Our framework consists of two stages: (1) an autoregressive model that sequentially predicts coarse labels for each token in the sequence, and (2) an auxiliary model that simultaneously predicts fine-grained labels for all tokens conditioned on their coarse labels.
arXiv Detail & Related papers (2025-03-20T14:41:29Z) - "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.
Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - Frequency Autoregressive Image Generation with Continuous Tokens [31.833852108014312]
We introduce the frequency progressive autoregressive (textbfFAR) paradigm and instantiate FAR with the continuous tokenizer.<n>We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset.
arXiv Detail & Related papers (2025-03-07T10:34:04Z) - Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy.
By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z) - Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion [61.03681839276652]
Diffusion Forcing is a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels.<n>We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens.
arXiv Detail & Related papers (2024-07-01T15:43:25Z) - Non-autoregressive Sequence-to-Sequence Vision-Language Models [59.445765313094434]
We propose a parallel decoding sequence-to-sequence vision-language model that marginalizes over multiple inference paths in the decoder.<n>The model achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time.
arXiv Detail & Related papers (2024-03-04T17:34:59Z) - ChiroDiff: Modelling chirographic data with Diffusion Models [132.5223191478268]
We introduce a powerful model-class namely "Denoising Diffusion Probabilistic Models" or DDPMs for chirographic data.
Our model named "ChiroDiff", being non-autoregressive, learns to capture holistic concepts and therefore remains resilient to higher temporal sampling rate.
arXiv Detail & Related papers (2023-04-07T15:17:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.