Related papers: M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

URL: http://arxiv.org/abs/2411.10433v1
Date: Fri, 15 Nov 2024 18:54:42 GMT
Title: M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation
Authors: Sucheng Ren, Yaodong Yu, Nataniel Ruiz, Feng Wang, Alan Yuille, Cihang Xie,
Abstract summary: We show that this scale-wise autoregressive framework can be effectively decoupled into textitintra-scale modeling We apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead. Experiments demonstrate that our method outperforms existing models in both image quality and generation speed.
Score: 39.97174784206976
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There exists recent work in computer vision, named VAR, that proposes a new autoregressive paradigm for image generation. Diverging from the vanilla next-token prediction, VAR structurally reformulates the image generation into a coarse to fine next-scale prediction. In this paper, we show that this scale-wise autoregressive framework can be effectively decoupled into \textit{intra-scale modeling}, which captures local spatial dependencies within each scale, and \textit{inter-scale modeling}, which models cross-scale relationships progressively from coarse-to-fine scales. This decoupling structure allows to rebuild VAR in a more computationally efficient manner. Specifically, for intra-scale modeling -- crucial for generating high-fidelity images -- we retain the original bidirectional self-attention design to ensure comprehensive modeling; for inter-scale modeling, which semantically connects different scales but is computationally intensive, we apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead. We term this new framework M-VAR. Extensive experiments demonstrate that our method outperforms existing models in both image quality and generation speed. For example, our 1.5B model, with fewer parameters and faster inference speed, outperforms the largest VAR-d30-2B. Moreover, our largest model M-VAR-d32 impressively registers 1.78 FID on ImageNet 256$\times$256 and outperforms the prior-art autoregressive models LlamaGen/VAR by 0.4/0.19 and popular diffusion models LDM/DiT by 1.82/0.49, respectively. Code is avaiable at \url{https://github.com/OliverRensu/MVAR}.

Related papers

FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction [91.09318592542509]
This work challenges the residual prediction paradigm in visual autoregressive modeling. It presents a new Flexible Visual AutoRegressive image generation paradigm. This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable.
arXiv Detail & Related papers (2025-02-27T17:39:17Z)
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient [52.96232442322824]
Collaborative Decoding (CoDe) is a novel efficient decoding strategy tailored for the Visual Auto-Regressive ( VAR) framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98.
arXiv Detail & Related papers (2024-11-26T15:13:15Z)
Randomized Autoregressive Visual Generation [26.195148077398223]
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation. RAR sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods.
arXiv Detail & Related papers (2024-11-01T17:59:58Z)
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective [52.778766190479374]
Latent-based image generative models have achieved notable success in image generation tasks. Despite sharing the same latent space, autoregressive models significantly lag behind LDMs and MIMs in image generation. We propose a simple but effective discrete image tokenizer to stabilize the latent space for image generative modeling.
arXiv Detail & Related papers (2024-10-16T12:13:17Z)
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL. We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution. Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z)
Scalable Autoregressive Image Generation with Mamba [23.027439743155192]
We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. Mamba is a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B.
arXiv Detail & Related papers (2024-08-22T09:27:49Z)
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation [52.509092010267665]
We introduce LlamaGen, a new family of image generation models that apply original next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly.
arXiv Detail & Related papers (2024-06-10T17:59:52Z)
A-SDM: Accelerating Stable Diffusion through Redundancy Removal and Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network. We then prune the redundancy blocks of the model and maintain the network performance. Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z)
Normalizing Flows with Multi-Scale Autoregressive Priors [131.895570212956]
We introduce channel-wise dependencies in their latent space through multi-scale autoregressive priors (mAR) Our mAR prior for models with split coupling flow layers (mAR-SCF) can better capture dependencies in complex multimodal data. We show that mAR-SCF allows for improved image generation quality, with gains in FID and Inception scores compared to state-of-the-art flow-based models.
arXiv Detail & Related papers (2020-04-08T09:07:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.