RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
- URL: http://arxiv.org/abs/2412.01827v1
- Date: Mon, 02 Dec 2024 18:59:53 GMT
- Title: RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
- Authors: Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, Yu-Xiong Wang,
- Abstract summary: RandAR is a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders.
Our design enables random order by inserting a "position instruction token" before each image token to be predicted.
RandAR supports inpainting, outpainting and resolution inference in a zero-shot manner.
- Score: 54.49937384788739
- License:
- Abstract: We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enables random order by inserting a "position instruction token" before each image token to be predicted, representing the spatial location of the next image token. Trained on randomly permuted token sequences -- a more challenging task than fixed-order generation, RandAR achieves comparable performance to its conventional raster-order counterpart. More importantly, decoder-only transformers trained from random orders acquire new capabilities. For the efficiency bottleneck of AR models, RandAR adopts parallel decoding with KV-Cache at inference time, enjoying 2.5x acceleration without sacrificing generation quality. Additionally, RandAR supports inpainting, outpainting and resolution extrapolation in a zero-shot manner. We hope RandAR inspires new directions for decoder-only visual generation models and broadens their applications across diverse scenarios. Our project page is at https://rand-ar.github.io/.
Related papers
- Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching [12.985270202599814]
Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process.
We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps?
We propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model.
arXiv Detail & Related papers (2024-12-22T20:21:54Z) - DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models [72.24305287508474]
We introduce DiCoDe, a novel approach to generate videos with a language model in an autoregressive manner.
By treating videos as temporal sequences, DiCoDe fully harnesses the capabilities of language models for autoregressive generation.
We evaluate DiCoDe both quantitatively and qualitatively, demonstrating that it performs comparably to existing methods in terms of quality.
arXiv Detail & Related papers (2024-12-05T18:57:06Z) - ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality [19.486745219466666]
ZipAR is a training-free, plug-and-play parallel decoding framework for auto-regressive (AR) visual generation.
ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining.
arXiv Detail & Related papers (2024-12-05T10:57:08Z) - Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient [52.96232442322824]
Collaborative Decoding (CoDe) is a novel efficient decoding strategy tailored for the Visual Auto-Regressive ( VAR) framework.
CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales.
CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98.
arXiv Detail & Related papers (2024-11-26T15:13:15Z) - LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding [30.630803933771865]
Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding.
LANTERN increases speed-ups by $mathbf1.75times$ and $mathbf1.82times$, as compared to greedy decoding and random sampling.
arXiv Detail & Related papers (2024-10-04T12:21:03Z) - Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation [74.15447383432262]
The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer.
We provide a tokenizer pre-trained on large-scale data, significantly outperforming Cosmos on zero-shot benchmarks.
We produce a family of auto-regressive image generation models ranging from 300M to 1.5B.
arXiv Detail & Related papers (2024-09-06T17:14:53Z) - Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction [33.57820997288788]
We present a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction"
Visual AutoRegressive modeling makes GPT-like AR models surpass diffusion transformers in image generation.
We have released all models and codes to promote the exploration of AR/token models for visual generation and unified learning.
arXiv Detail & Related papers (2024-04-03T17:59:53Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - NP-DRAW: A Non-Parametric Structured Latent Variable Modelfor Image
Generation [139.8037697822064]
We present a non-parametric structured latent variable model for image generation, called NP-DRAW.
It sequentially draws on a latent canvas in a part-by-part fashion and then decodes the image from the canvas.
arXiv Detail & Related papers (2021-06-25T05:17:55Z) - Match What Matters: Generative Implicit Feature Replay for Continual
Learning [0.0]
We propose GenIFeR (Generative Implicit Feature Replay) for class-incremental learning.
The main idea is to train a generative adversarial network (GAN) to generate images that contain realistic features.
We empirically show that GenIFeR is superior to both conventional generative image and feature replay.
arXiv Detail & Related papers (2021-06-09T19:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.