Neighboring Autoregressive Modeling for Efficient Visual Generation
- URL: http://arxiv.org/abs/2503.10696v1
- Date: Wed, 12 Mar 2025 05:52:27 GMT
- Title: Neighboring Autoregressive Modeling for Efficient Visual Generation
- Authors: Yefei He, Yuanyu He, Shaoxuan He, Feng Chen, Hong Zhou, Kaipeng Zhang, Bohan Zhuang,
- Abstract summary: Neighboring Autoregressive Modeling (NAR) is a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure.<n>To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads.<n>Experiments on ImageNet$256times 256$ and UCF101 demonstrate that NAR achieves 2.4$times$ and 8.6$times$ higher throughput respectively.
- Score: 19.486745219466666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual autoregressive models typically adhere to a raster-order ``next-token prediction" paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant. In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far ``next-neighbor prediction" mechanism. Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance from the initial token in the spatial-temporal space, progressively expanding the boundary of the decoded region. To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads, each predicting the next token along a mutually orthogonal dimension. During inference, all tokens adjacent to the decoded tokens are processed in parallel, substantially reducing the model forward steps for generation. Experiments on ImageNet$256\times 256$ and UCF101 demonstrate that NAR achieves 2.4$\times$ and 8.6$\times$ higher throughput respectively, while obtaining superior FID/FVD scores for both image and video generation tasks compared to the PAR-4X approach. When evaluating on text-to-image generation benchmark GenEval, NAR with 0.8B parameters outperforms Chameleon-7B while using merely 0.4 of the training data. Code is available at https://github.com/ThisisBillhe/NAR.
Related papers
- Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.
Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.
In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis [64.12708207721276]
We introduce a novel pseudo-autoregressive (PAR) language modeling approach that unifies AR and NAR modeling.
Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement.
Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data.
arXiv Detail & Related papers (2025-04-14T16:03:21Z) - Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation [34.112157859384645]
Autoregressive (AR) modeling underpins state-of-the-art language and visual generative models.<n>Traditionally, a token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision.<n>We propose xAR, a framework that extends the notion of a token to an entity X.
arXiv Detail & Related papers (2025-02-27T18:59:08Z) - Parallelized Autoregressive Visual Generation [65.9579525736345]
We propose a simple yet effective approach for parallelized autoregressive visual generation.<n>Our method achieves a 3.6x speedup with comparable quality and up to 9.5x speedup with minimal quality degradation across both image and video generation tasks.
arXiv Detail & Related papers (2024-12-19T17:59:54Z) - ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality [19.486745219466666]
ZipAR is a training-free, plug-and-play parallel decoding framework for auto-regressive (AR) visual generation.<n>ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining.
arXiv Detail & Related papers (2024-12-05T10:57:08Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction.
Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution.
This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z) - SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt [59.280491260635266]
Methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP.
Mymodel model learns a two-dimensional prompt token map with equal (or scaled) size to the image token map.
Our model can conduct individual prompting for different image tokens in a fine-grained manner.
arXiv Detail & Related papers (2023-12-16T08:23:43Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.