DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer
- URL: http://arxiv.org/abs/2507.04947v1
- Date: Mon, 07 Jul 2025 12:45:23 GMT
- Title: DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer
- Authors: Yecheng Wu, Junyu Chen, Zhuoyang Zhang, Enze Xie, Jincheng Yu, Junsong Chen, Jinyi Hu, Yao Lu, Song Han, Han Cai,
- Abstract summary: DC-AR is a novel masked autoregressive (AR) text-to-image generation framework.<n>It delivers superior image generation quality with exceptional computational efficiency.
- Score: 32.64616770377737
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT - a deep compression hybrid tokenizer for AR models that achieves a 32x spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens. DC-AR achieves state-of-the-art results with a gFID of 5.49 on MJHQ-30K and an overall score of 0.69 on GenEval, while offering 1.5-7.9x higher throughput and 2.0-3.5x lower latency compared to prior leading diffusion and autoregressive models.
Related papers
- LAFR: Efficient Diffusion-based Blind Face Restoration via Latent Codebook Alignment Adapter [52.93785843453579]
Blind face restoration from low-quality (LQ) images is a challenging task that requires high-fidelity image reconstruction and the preservation of facial identity.<n>We propose LAFR, a novel codebook-based latent space adapter that aligns the latent distribution of LQ images with that of HQ counterparts.<n>We show that lightweight finetuning of diffusion prior on just 0.9% of FFHQ dataset is sufficient to achieve results comparable to state-of-the-art methods.
arXiv Detail & Related papers (2025-05-29T14:11:16Z) - DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction [47.483590046908844]
This paper presents DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method.<n>By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure.<n>Our method achieves high-quality image synthesis with significantly fewer tokens than previous approaches.
arXiv Detail & Related papers (2025-05-27T17:45:21Z) - GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation [62.77721499671665]
We introduce GigaTok, the first approach to improve image reconstruction, generation, and representation learning when scaling visual tokenizers.<n>We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma.<n>By scaling to $bf3 space billion$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
arXiv Detail & Related papers (2025-04-11T17:59:58Z) - ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration [75.0053551643052]
We introduce ZipIR, a novel framework that enhances efficiency, scalability, and long-range modeling for high-res image restoration.<n>ZipIR employs a highly compressed latent representation that compresses image 32x, effectively reducing the number of spatial tokens.<n>ZipIR surpasses existing diffusion-based methods, offering unmatched speed and quality in restoring high-resolution images from severely degraded inputs.
arXiv Detail & Related papers (2025-04-11T14:49:52Z) - Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis [57.7367843129838]
Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer.<n>We propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction.
arXiv Detail & Related papers (2025-03-11T12:09:11Z) - Masked Autoencoders Are Effective Tokenizers for Diffusion Models [56.08109308294133]
MAETok is an autoencoder that learns semantically rich latent space while maintaining reconstruction fidelity.<n>MaETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation.
arXiv Detail & Related papers (2025-02-05T18:42:04Z) - Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient [52.96232442322824]
Collaborative Decoding (CoDe) is a novel efficient decoding strategy tailored for the Visual Auto-Regressive ( VAR) framework.<n>CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales.<n>CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98.
arXiv Detail & Related papers (2024-11-26T15:13:15Z) - HART: Efficient Visual Generation with Hybrid Autoregressive Transformer [33.97880303341509]
We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images.
Our approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38.
HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs.
arXiv Detail & Related papers (2024-10-14T17:59:42Z) - Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.57727062920458]
We present Meissonic, which elevates non-autoregressive masked image modeling (MIM) text-to-image to a level comparable with state-of-the-art diffusion models like SDXL.<n>We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers.<n>Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation [36.69567056569989]
We propose an Auto-regressive Auto-encoder (ArAE) model capable of generating high-quality 3D meshes with up to 4,000 faces at a spatial resolution of $5123$.
We introduce a novel mesh tokenization algorithm that efficiently compresses triangular meshes into 1D token sequences, significantly enhancing training efficiency.
Our model compresses variable-length triangular meshes into a fixed-length latent space, enabling training latent diffusion models for better generalization.
arXiv Detail & Related papers (2024-09-26T17:55:02Z) - Dual-former: Hybrid Self-attention Transformer for Efficient Image
Restoration [6.611849560359801]
We present Dual-former, which combines the powerful global modeling ability of self-attention modules and the local modeling ability of convolutions in an overall architecture.
Experiments demonstrate that Dual-former achieves a 1.91dB gain over the state-of-the-art MAXIM method on the Indoor dataset for single image dehazing.
For single image deraining, it exceeds the SOTA method by 0.1dB PSNR on the average results of five datasets with only 21.5% GFLOPs.
arXiv Detail & Related papers (2022-10-03T16:39:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.