TensorAR: Refinement is All You Need in Autoregressive Image Generation
- URL: http://arxiv.org/abs/2505.16324v1
- Date: Thu, 22 May 2025 07:27:25 GMT
- Title: TensorAR: Refinement is All You Need in Autoregressive Image Generation
- Authors: Cheng Cheng, Lin Song, Yicheng Xiao, Yuxin Chen, Xuchong Zhang, Hongbin Sun, Ying Shan,
- Abstract summary: Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence.<n>Unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality.<n>In this paper, we introduce a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction.
- Score: 45.38495724606076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that TensorAR significantly improves the generation performance of autoregressive models.
Related papers
- Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis [79.98107530577576]
DisCon is a novel framework that reinterprets discrete tokens as conditional signals rather than generation targets.<n>DisCon achieves a gFID score of 1.38 on ImageNet 256$times $256 generation, outperforming state-of-the-art autoregressive approaches by a clear margin.
arXiv Detail & Related papers (2025-07-02T14:33:52Z) - AR-RAG: Autoregressive Retrieval Augmentation for Image Generation [35.008697736838194]
We introduce Autoregressive Retrieval Augmentation (AR-RAG), a novel paradigm that enhances image generation by autoregressively incorporating knearest neighbor retrievals at the patch level.<n>We validate the effectiveness of AR-RAG on widely adopted benchmarks, including Midjourney-30K, GenEval and DPG-Bench.
arXiv Detail & Related papers (2025-06-08T01:33:05Z) - HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation [91.08481618973111]
Visual Auto-Regressive modeling ( VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models.<n>We introduce Hierarchical Masked Auto-Regressive modeling (HMAR) to generate high-quality images with fast sampling.<n>HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor.
arXiv Detail & Related papers (2025-06-04T20:08:07Z) - Revealing the Implicit Noise-based Imprint of Generative Models [71.94916898756684]
This paper presents a novel framework that leverages noise-based model-specific imprint for the detection task.<n>By aggregating imprints from various generative models, imprints of future models can be extrapolated to expand training data.<n>Our approach achieves state-of-the-art performance across three public benchmarks including GenImage, Synthbuster and Chameleon.
arXiv Detail & Related papers (2025-03-12T12:04:53Z) - Frequency Autoregressive Image Generation with Continuous Tokens [31.833852108014312]
We introduce the frequency progressive autoregressive (textbfFAR) paradigm and instantiate FAR with the continuous tokenizer.<n>We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset.
arXiv Detail & Related papers (2025-03-07T10:34:04Z) - Autoregressive Video Generation without Vector Quantization [90.87907377618747]
We reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction.<n>With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA.<n>Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity.
arXiv Detail & Related papers (2024-12-18T18:59:53Z) - RandAR: Decoder-only Autoregressive Visual Generation in Random Orders [54.49937384788739]
RandAR is a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders.<n>Our design enables random order by inserting a "position instruction token" before each image token to be predicted.<n>RandAR supports inpainting, outpainting and resolution inference in a zero-shot manner.
arXiv Detail & Related papers (2024-12-02T18:59:53Z) - Randomized Autoregressive Visual Generation [26.195148077398223]
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation.
RAR sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks.
On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods.
arXiv Detail & Related papers (2024-11-01T17:59:58Z) - LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding [30.630803933771865]
Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding.<n> LANTERN increases speed-ups by $mathbf1.75times$ and $mathbf1.82times$, as compared to greedy decoding and random sampling.
arXiv Detail & Related papers (2024-10-04T12:21:03Z) - Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction [33.57820997288788]
We present a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction"
Visual AutoRegressive modeling makes GPT-like AR models surpass diffusion transformers in image generation.
We have released all models and codes to promote the exploration of AR/token models for visual generation and unified learning.
arXiv Detail & Related papers (2024-04-03T17:59:53Z) - NP-DRAW: A Non-Parametric Structured Latent Variable Modelfor Image
Generation [139.8037697822064]
We present a non-parametric structured latent variable model for image generation, called NP-DRAW.
It sequentially draws on a latent canvas in a part-by-part fashion and then decodes the image from the canvas.
arXiv Detail & Related papers (2021-06-25T05:17:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.