Related papers: TensorAR: Refinement is All You Need in Autoregressive Image Generation

TensorAR: Refinement is All You Need in Autoregressive Image Generation

URL: http://arxiv.org/abs/2505.16324v1
Date: Thu, 22 May 2025 07:27:25 GMT
Title: TensorAR: Refinement is All You Need in Autoregressive Image Generation
Authors: Cheng Cheng, Lin Song, Yicheng Xiao, Yuxin Chen, Xuchong Zhang, Hongbin Sun, Ying Shan,
Abstract summary: Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence.<n>Unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality.<n>In this paper, we introduce a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction.
Score: 45.38495724606076
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that TensorAR significantly improves the generation performance of autoregressive models.

Related papers

Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis [79.98107530577576]
DisCon is a novel framework that reinterprets discrete tokens as conditional signals rather than generation targets.<n>DisCon achieves a gFID score of 1.38 on ImageNet 256$times $256 generation, outperforming state-of-the-art autoregressive approaches by a clear margin.
arXiv Detail & Related papers (2025-07-02T14:33:52Z)
AR-RAG: Autoregressive Retrieval Augmentation for Image Generation [35.008697736838194]
We introduce Autoregressive Retrieval Augmentation (AR-RAG), a novel paradigm that enhances image generation by autoregressively incorporating knearest neighbor retrievals at the patch level.<n>We validate the effectiveness of AR-RAG on widely adopted benchmarks, including Midjourney-30K, GenEval and DPG-Bench.
arXiv Detail & Related papers (2025-06-08T01:33:05Z)
HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation [91.08481618973111]
Visual Auto-Regressive modeling ( VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models.<n>We introduce Hierarchical Masked Auto-Regressive modeling (HMAR) to generate high-quality images with fast sampling.<n>HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor.
arXiv Detail & Related papers (2025-06-04T20:08:07Z)
Revealing the Implicit Noise-based Imprint of Generative Models [71.94916898756684]
This paper presents a novel framework that leverages noise-based model-specific imprint for the detection task.<n>By aggregating imprints from various generative models, imprints of future models can be extrapolated to expand training data.<n>Our approach achieves state-of-the-art performance across three public benchmarks including GenImage, Synthbuster and Chameleon.
arXiv Detail & Related papers (2025-03-12T12:04:53Z)
Frequency Autoregressive Image Generation with Continuous Tokens [31.833852108014312]
We introduce the frequency progressive autoregressive (textbfFAR) paradigm and instantiate FAR with the continuous tokenizer.<n>We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset.
arXiv Detail & Related papers (2025-03-07T10:34:04Z)
Autoregressive Video Generation without Vector Quantization [90.87907377618747]
We reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction.<n>With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA.<n>Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity.
arXiv Detail & Related papers (2024-12-18T18:59:53Z)
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders [54.49937384788739]
RandAR is a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders.<n>Our design enables random order by inserting a "position instruction token" before each image token to be predicted.<n>RandAR supports inpainting, outpainting and resolution inference in a zero-shot manner.
arXiv Detail & Related papers (2024-12-02T18:59:53Z)
Randomized Autoregressive Visual Generation [26.195148077398223]
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation. RAR sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods.
arXiv Detail & Related papers (2024-11-01T17:59:58Z)
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding [30.630803933771865]
Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding.<n> LANTERN increases speed-ups by $mathbf1.75times$ and $mathbf1.82times$, as compared to greedy decoding and random sampling.
arXiv Detail & Related papers (2024-10-04T12:21:03Z)
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction [33.57820997288788]
We present a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" Visual AutoRegressive modeling makes GPT-like AR models surpass diffusion transformers in image generation. We have released all models and codes to promote the exploration of AR/token models for visual generation and unified learning.
arXiv Detail & Related papers (2024-04-03T17:59:53Z)
NP-DRAW: A Non-Parametric Structured Latent Variable Modelfor Image Generation [139.8037697822064]
We present a non-parametric structured latent variable model for image generation, called NP-DRAW. It sequentially draws on a latent canvas in a part-by-part fashion and then decodes the image from the canvas.
arXiv Detail & Related papers (2021-06-25T05:17:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.