Related papers: HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

URL: http://arxiv.org/abs/2410.10812v1
Date: Mon, 14 Oct 2024 17:59:42 GMT
Title: HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
Authors: Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, Song Han,
Abstract summary: We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images. Our approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38. HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs.
Score: 33.97880303341509
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images, rivaling diffusion models in image generation quality. Existing AR models face limitations due to the poor image reconstruction quality of their discrete tokenizers and the prohibitive training costs associated with generating 1024px images. To address these challenges, we present the hybrid tokenizer, which decomposes the continuous latents from the autoencoder into two components: discrete tokens representing the big picture and continuous tokens representing the residual components that cannot be represented by the discrete tokens. The discrete component is modeled by a scalable-resolution discrete AR model, while the continuous component is learned with a lightweight residual diffusion module with only 37M parameters. Compared with the discrete-only VAR tokenizer, our hybrid approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38. HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs. Our code is open sourced at https://github.com/mit-han-lab/hart.

Related papers

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer. Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens [80.75893450536577]
We propose D2C, a novel two-stage method to enhance model generation capacity. In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator. In the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence.
arXiv Detail & Related papers (2025-03-21T13:58:49Z)
Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis [57.7367843129838]
Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer. We propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction.
arXiv Detail & Related papers (2025-03-11T12:09:11Z)
Improving the Diffusability of Autoencoders [54.920783089085035]
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos. We perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality.
arXiv Detail & Related papers (2025-02-20T18:45:44Z)
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling [11.075247758198762]
Latent generative models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. We enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a 7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning.
arXiv Detail & Related papers (2025-02-13T17:21:51Z)
Masked Autoencoders Are Effective Tokenizers for Diffusion Models [56.08109308294133]
MAETok is an autoencoder that learns semantically rich latent space while maintaining reconstruction fidelity. MaETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation.
arXiv Detail & Related papers (2025-02-05T18:42:04Z)
3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation [4.221298212125194]
Variational Tokenizer (VAT) transforms unordered 3D data into compact latent tokens with an implicit hierarchy. VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization.
arXiv Detail & Related papers (2024-12-03T06:31:25Z)
Boosting Latent Diffusion with Perceptual Objectives [29.107038084215514]
Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models. We propose to leverage the internal features of the decoder to define a latent perceptual loss (LPL) This loss encourages the models to create sharper and more realistic images.
arXiv Detail & Related papers (2024-11-06T16:28:21Z)
Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models [8.352666876052616]
We introduce the Diff-Instruct*(DI*), a data-free approach for building one-step text-to-image generative models. With Stable Diffusion V1.5 as the reference diffusion model, DI* outperforms emphall previously leading models by a large margin.
arXiv Detail & Related papers (2024-10-28T10:26:19Z)
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL. We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution. Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z)
MaskBit: Embedding-free Image Generation via Bit Tokens [54.827480008982185]
We present an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. A novel embedding-free generation network operating directly on bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark, with a compact generator model of mere 305M parameters.
arXiv Detail & Related papers (2024-09-24T16:12:12Z)
LinFusion: 1 GPU, 1 Minute, 16K Image [71.44735417472043]
We introduce a low-rank approximation of a wide spectrum of popular linear token mixers. We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD. Experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation.
arXiv Detail & Related papers (2024-09-03T17:54:39Z)
An Image is Worth 32 Tokens for Reconstruction and Generation [54.24414696392026]
Transformer-based 1-Dimensional Tokenizer (TiTok) is an innovative approach that tokenizes images into 1D latent sequences. TiTok achieves competitive performance to state-of-the-art approaches. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
arXiv Detail & Related papers (2024-06-11T17:59:56Z)
One-step Diffusion with Distribution Matching Distillation [54.723565605974294]
We introduce Distribution Matching Distillation (DMD), a procedure to transform a diffusion model into a one-step image generator. We enforce the one-step image generator match the diffusion model at distribution level, by minimizing an approximate KL divergence. Our method outperforms all published few-step diffusion approaches, reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot COCO-30k.
arXiv Detail & Related papers (2023-11-30T18:59:20Z)
StraIT: Non-autoregressive Generation with Stratified Image Transformer [63.158996766036736]
Stratified Image Transformer(StraIT) is a pure non-autoregressive(NAR) generative model. Our experiments demonstrate that StraIT significantly improves NAR generation and out-performs existing DMs and AR methods.
arXiv Detail & Related papers (2023-03-01T18:59:33Z)
Dual-former: Hybrid Self-attention Transformer for Efficient Image Restoration [6.611849560359801]
We present Dual-former, which combines the powerful global modeling ability of self-attention modules and the local modeling ability of convolutions in an overall architecture. Experiments demonstrate that Dual-former achieves a 1.91dB gain over the state-of-the-art MAXIM method on the Indoor dataset for single image dehazing. For single image deraining, it exceeds the SOTA method by 0.1dB PSNR on the average results of five datasets with only 21.5% GFLOPs.
arXiv Detail & Related papers (2022-10-03T16:39:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.