Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
- URL: http://arxiv.org/abs/2502.20388v2
- Date: Thu, 20 Mar 2025 18:15:30 GMT
- Title: Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
- Authors: Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen,
- Abstract summary: Autoregressive (AR) modeling underpins state-of-the-art language and visual generative models.<n>Traditionally, a token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision.<n>We propose xAR, a framework that extends the notion of a token to an entity X.
- Score: 34.112157859384645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a $k\times k$ grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20$\times$ faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2$\times$ faster than the previous best-performing model without relying on vision foundation modules (e.g., DINOv2) or advanced guidance interval sampling.
Related papers
- Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.
Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.
In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Neighboring Autoregressive Modeling for Efficient Visual Generation [19.486745219466666]
Neighboring Autoregressive Modeling (NAR) is a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure.
To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads.
Experiments on ImageNet$256times 256$ and UCF101 demonstrate that NAR achieves 2.4$times$ and 8.6$times$ higher throughput respectively.
arXiv Detail & Related papers (2025-03-12T05:52:27Z) - ARINAR: Bi-Level Autoregressive Feature-by-Feature Generative Models [37.65992612575692]
ARINAR-B with 213M parameters achieves an FID of 2.75, which is comparable to the state-of-the-art MAR-B model (FID=2.31), while five times faster than the latter.
ARINAR-B with 213M parameters achieves an FID of 2.75, which is comparable to the state-of-the-art MAR-B model (FID=2.31), while five times faster than the latter.
arXiv Detail & Related papers (2025-03-04T18:59:56Z) - Next Patch Prediction for Autoregressive Visual Generation [58.73461205369825]
We propose a novel Next Patch Prediction (NPP) paradigm for autoregressive image generation.<n>Our key idea is to group and aggregate image tokens into patch tokens containing high information density.<n>With patch tokens as a shorter input sequence, the autoregressive model is trained to predict the next patch, thereby significantly reducing the computational cost.
arXiv Detail & Related papers (2024-12-19T18:59:36Z) - FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching [34.112157859384645]
We introduce FlowAR, a next scale prediction method featuring a streamlined scale design.<n>This eliminates the need for VAR's intricate multi-scale residual tokenizer.<n>We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark.
arXiv Detail & Related papers (2024-12-19T18:59:31Z) - RandAR: Decoder-only Autoregressive Visual Generation in Random Orders [54.49937384788739]
RandAR is a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders.<n>Our design enables random order by inserting a "position instruction token" before each image token to be predicted.<n>RandAR supports inpainting, outpainting and resolution inference in a zero-shot manner.
arXiv Detail & Related papers (2024-12-02T18:59:53Z) - Sample- and Parameter-Efficient Auto-Regressive Image Models [15.501863812794209]
We introduce XTRA, a vision model pre-trained with a novel auto-regressive objective.
XTRA employs a Block Causal Mask, where each Block represents k $times$ k tokens rather than relying on a standard causal mask.
By reconstructing pixel values block by block, XTRA captures higher-level structural patterns over larger image regions.
arXiv Detail & Related papers (2024-11-23T20:40:46Z) - Randomized Autoregressive Visual Generation [26.195148077398223]
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation.
RAR sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks.
On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods.
arXiv Detail & Related papers (2024-11-01T17:59:58Z) - Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction [33.57820997288788]
We present a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction"
Visual AutoRegressive modeling makes GPT-like AR models surpass diffusion transformers in image generation.
We have released all models and codes to promote the exploration of AR/token models for visual generation and unified learning.
arXiv Detail & Related papers (2024-04-03T17:59:53Z) - Centroid-centered Modeling for Efficient Vision Transformer Pre-training [44.24223088955106]
Masked Image Modeling (MIM) is a new self-supervised vision pre-training paradigm using a Vision Transformer (ViT)
Our proposed centroid-based approach, CCViT, leverages k-means clustering to obtain centroids for image modeling without supervised training of the tokenizer model.
Our approach achieves competitive results with recent baselines without external supervision and distillation training from other models.
arXiv Detail & Related papers (2023-03-08T15:34:57Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - NP-DRAW: A Non-Parametric Structured Latent Variable Modelfor Image
Generation [139.8037697822064]
We present a non-parametric structured latent variable model for image generation, called NP-DRAW.
It sequentially draws on a latent canvas in a part-by-part fashion and then decodes the image from the canvas.
arXiv Detail & Related papers (2021-06-25T05:17:55Z) - TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech
Recognition [69.68154370877615]
The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step.
To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT)
The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
arXiv Detail & Related papers (2021-04-04T02:34:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.