HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation
- URL: http://arxiv.org/abs/2506.04421v1
- Date: Wed, 04 Jun 2025 20:08:07 GMT
- Title: HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation
- Authors: Hermann Kumbong, Xian Liu, Tsung-Yi Lin, Ming-Yu Liu, Xihui Liu, Ziwei Liu, Daniel Y. Fu, Christopher RĂ©, David W. Romero,
- Abstract summary: Visual Auto-Regressive modeling ( VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models.<n>We introduce Hierarchical Masked Auto-Regressive modeling (HMAR) to generate high-quality images with fast sampling.<n>HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor.
- Score: 91.08481618973111
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Auto-Regressive modeling (VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolution scales. During inference, an image is generated by predicting all the tokens in the next (higher-resolution) scale, conditioned on all tokens in all previous (lower-resolution) scales. However, this formulation suffers from reduced image quality due to the parallel generation of all tokens in a resolution scale; has sequence lengths scaling superlinearly in image resolution; and requires retraining to change the sampling schedule. We introduce Hierarchical Masked Auto-Regressive modeling (HMAR), a new image generation algorithm that alleviates these issues using next-scale prediction and masked prediction to generate high-quality images with fast sampling. HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor instead of the tokens in all predecessor resolutions. When predicting a resolution scale, HMAR uses a controllable multi-step masked generation procedure to generate a subset of the tokens in each step. On ImageNet 256x256 and 512x512 benchmarks, HMAR models match or outperform parameter-matched VAR, diffusion, and autoregressive baselines. We develop efficient IO-aware block-sparse attention kernels that allow HMAR to achieve faster training and inference times over VAR by over 2.5x and 1.75x respectively, as well as over 3x lower inference memory footprint. Finally, HMAR yields additional flexibility over VAR; its sampling schedule can be changed without further training, and it can be applied to image editing tasks in a zero-shot manner.
Related papers
- AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model [59.065471969232284]
We propose a novel Aligned Tokenizer (AliTok) to align the tokenizer and autoregressive model.<n>On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator, AliTok achieves a gFID score of 1.50 and an IS of 305.9.<n>When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed.
arXiv Detail & Related papers (2025-06-05T17:45:10Z) - Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots [103.48424042986271]
We introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens.<n>We present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling.
arXiv Detail & Related papers (2025-05-26T17:59:07Z) - Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation [62.77721499671665]
We introduce GigaTok, the first approach to improve image reconstruction, generation, and representation learning when scaling visual tokenizers.<n>We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma.<n>By scaling to $bf3 space billion$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
arXiv Detail & Related papers (2025-04-11T17:59:58Z) - Frequency Autoregressive Image Generation with Continuous Tokens [31.833852108014312]
We introduce the frequency progressive autoregressive (textbfFAR) paradigm and instantiate FAR with the continuous tokenizer.<n>We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset.
arXiv Detail & Related papers (2025-03-07T10:34:04Z) - FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching [34.112157859384645]
We introduce FlowAR, a next scale prediction method featuring a streamlined scale design.<n>This eliminates the need for VAR's intricate multi-scale residual tokenizer.<n>We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark.
arXiv Detail & Related papers (2024-12-19T18:59:31Z) - MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.<n>Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss in an efficient way.<n>We also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals.
arXiv Detail & Related papers (2024-10-14T17:57:18Z) - Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.57727062920458]
We present Meissonic, which elevates non-autoregressive masked image modeling (MIM) text-to-image to a level comparable with state-of-the-art diffusion models like SDXL.<n>We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers.<n>Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - Progressive Text-to-Image Generation [40.09326229583334]
We present a progressive model for high-fidelity text-to-image generation.
The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context.
The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable.
arXiv Detail & Related papers (2022-10-05T14:27:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.