HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
- URL: http://arxiv.org/abs/2410.10812v1
- Date: Mon, 14 Oct 2024 17:59:42 GMT
- Title: HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
- Authors: Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, Song Han,
- Abstract summary: We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images.
Our approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38.
HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs.
- Score: 33.97880303341509
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images, rivaling diffusion models in image generation quality. Existing AR models face limitations due to the poor image reconstruction quality of their discrete tokenizers and the prohibitive training costs associated with generating 1024px images. To address these challenges, we present the hybrid tokenizer, which decomposes the continuous latents from the autoencoder into two components: discrete tokens representing the big picture and continuous tokens representing the residual components that cannot be represented by the discrete tokens. The discrete component is modeled by a scalable-resolution discrete AR model, while the continuous component is learned with a lightweight residual diffusion module with only 37M parameters. Compared with the discrete-only VAR tokenizer, our hybrid approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38. HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs. Our code is open sourced at https://github.com/mit-han-lab/hart.
Related papers
- Improving the Diffusability of Autoencoders [54.920783089085035]
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos.
We perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces.
We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality.
arXiv Detail & Related papers (2025-02-20T18:45:44Z) - EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling [11.075247758198762]
Latent generative models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution.
We propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality.
We enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a 7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning.
arXiv Detail & Related papers (2025-02-13T17:21:51Z) - Masked Autoencoders Are Effective Tokenizers for Diffusion Models [56.08109308294133]
MAETok is an autoencoder that learns semantically rich latent space while maintaining reconstruction fidelity.
MaETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation.
arXiv Detail & Related papers (2025-02-05T18:42:04Z) - Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models [34.15905637499148]
We propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers.
Our proposed VA-VAE significantly expands the reconstruction-generation frontier of latent diffusion models.
We build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT.
arXiv Detail & Related papers (2025-01-02T18:59:40Z) - 3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation [4.221298212125194]
Variational Tokenizer (VAT) transforms unordered 3D data into compact latent tokens with an implicit hierarchy.
VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization.
arXiv Detail & Related papers (2024-12-03T06:31:25Z) - Boosting Latent Diffusion with Perceptual Objectives [29.107038084215514]
Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models.
We propose to leverage the internal features of the decoder to define a latent perceptual loss (LPL)
This loss encourages the models to create sharper and more realistic images.
arXiv Detail & Related papers (2024-11-06T16:28:21Z) - MaskBit: Embedding-free Image Generation via Bit Tokens [54.827480008982185]
We present an empirical and systematic examination of VQGANs, leading to a modernized VQGAN.
Second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark.
arXiv Detail & Related papers (2024-09-24T16:12:12Z) - An Image is Worth 32 Tokens for Reconstruction and Generation [54.24414696392026]
Transformer-based 1-Dimensional Tokenizer (TiTok) is an innovative approach that tokenizes images into 1D latent sequences.
TiTok achieves competitive performance to state-of-the-art approaches.
Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
arXiv Detail & Related papers (2024-06-11T17:59:56Z) - One-step Diffusion with Distribution Matching Distillation [54.723565605974294]
We introduce Distribution Matching Distillation (DMD), a procedure to transform a diffusion model into a one-step image generator.
We enforce the one-step image generator match the diffusion model at distribution level, by minimizing an approximate KL divergence.
Our method outperforms all published few-step diffusion approaches, reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot COCO-30k.
arXiv Detail & Related papers (2023-11-30T18:59:20Z) - StraIT: Non-autoregressive Generation with Stratified Image Transformer [63.158996766036736]
Stratified Image Transformer(StraIT) is a pure non-autoregressive(NAR) generative model.
Our experiments demonstrate that StraIT significantly improves NAR generation and out-performs existing DMs and AR methods.
arXiv Detail & Related papers (2023-03-01T18:59:33Z) - Dual-former: Hybrid Self-attention Transformer for Efficient Image
Restoration [6.611849560359801]
We present Dual-former, which combines the powerful global modeling ability of self-attention modules and the local modeling ability of convolutions in an overall architecture.
Experiments demonstrate that Dual-former achieves a 1.91dB gain over the state-of-the-art MAXIM method on the Indoor dataset for single image dehazing.
For single image deraining, it exceeds the SOTA method by 0.1dB PSNR on the average results of five datasets with only 21.5% GFLOPs.
arXiv Detail & Related papers (2022-10-03T16:39:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.