LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models
- URL: http://arxiv.org/abs/2405.14477v2
- Date: Tue, 21 Jan 2025 17:15:10 GMT
- Title: LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models
- Authors: Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, Romann M. Weber,
- Abstract summary: We introduce LiteVAE, a new autoencoder design for latent diffusion models (LDMs)
LiteVAE uses the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality.
- Score: 27.795088366122297
- License:
- Abstract: Advances in latent diffusion models (LDMs) have revolutionized high-resolution image generation, but the design space of the autoencoder that is central to these systems remains underexplored. In this paper, we introduce LiteVAE, a new autoencoder design for LDMs, which leverages the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality. We investigate the training methodologies and the decoder architecture of LiteVAE and propose several enhancements that improve the training dynamics and reconstruction quality. Our base LiteVAE model matches the quality of the established VAEs in current LDMs with a six-fold reduction in encoder parameters, leading to faster training and lower GPU memory requirements, while our larger model outperforms VAEs of comparable complexity across all evaluated metrics (rFID, LPIPS, PSNR, and SSIM).
Related papers
- EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling [11.075247758198762]
Latent generative models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution.
We propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality.
We enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a 7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning.
arXiv Detail & Related papers (2025-02-13T17:21:51Z) - EVEv2: Improved Baselines for Encoder-Free Vision-Language Models [72.07868838411474]
Existing encoder-free vision-language models (VLMs) are narrowing the performance gap with their encoder-based counterparts.
We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones.
We show that properly and hierarchically associating vision and language within a unified model reduces interference between modalities.
arXiv Detail & Related papers (2025-02-10T18:59:58Z) - Masked Autoencoders Are Effective Tokenizers for Diffusion Models [56.08109308294133]
MAETok is an autoencoder that learns semantically rich latent space while maintaining reconstruction fidelity.
MaETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation.
arXiv Detail & Related papers (2025-02-05T18:42:04Z) - Low Resource Video Super-resolution using Memory and Residual Deformable Convolutions [3.018928786249079]
Transformer-based video super-resolution (VSR) models have set new benchmarks in recent years, but their substantial computational demands make most of them unsuitable for deployment on resource-constrained devices.
We propose a novel lightweight, parameter-efficient deep residual deformable convolution network for VSR.
With just 2.3 million parameters, our model achieves state-of-the-art SSIM of 0.9175 on the REDS4 dataset.
arXiv Detail & Related papers (2025-02-03T20:46:15Z) - Exploring Representation-Aligned Latent Space for Better Generation [86.45670422239317]
We introduce ReaLS, which integrates semantic priors to improve generation performance.
We show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric.
The enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
arXiv Detail & Related papers (2025-02-01T07:42:12Z) - Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models [34.15905637499148]
We propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers.
Our proposed VA-VAE significantly expands the reconstruction-generation frontier of latent diffusion models.
We build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT.
arXiv Detail & Related papers (2025-01-02T18:59:40Z) - p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay [18.958138693220704]
We propose to build efficient multimodal large language models (MLLMs) by leveraging the Mixture-of-Depths (MoD) mechanism.
We adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing)
Our model, p-MoD, matches or even surpasses the performance of the baseline models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.
arXiv Detail & Related papers (2024-12-05T18:58:03Z) - ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation [83.62931466231898]
This paper presents ARLON, a framework that boosts diffusion Transformers with autoregressive models for long video generation.
A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens.
An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model.
arXiv Detail & Related papers (2024-10-27T16:28:28Z) - LinFusion: 1 GPU, 1 Minute, 16K Image [71.44735417472043]
We introduce a low-rank approximation of a wide spectrum of popular linear token mixers.
We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD.
Experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation.
arXiv Detail & Related papers (2024-09-03T17:54:39Z) - Attentive VQ-VAE [0.0]
We present a novel approach to enhance the capabilities of VQ-VAE models through the integration of a Residual encoder and a Residual Pixel Attention layer, named Attentive Residual (AREN)
The AREN is designed to operate effectively at multiple levels, accommodating diverse architectural complexities.
arXiv Detail & Related papers (2023-09-20T21:11:36Z) - Complexity Matters: Rethinking the Latent Space for Generative Modeling [65.64763873078114]
In generative modeling, numerous successful approaches leverage a low-dimensional latent space, e.g., Stable Diffusion.
In this study, we aim to shed light on this under-explored topic by rethinking the latent space from the perspective of model complexity.
arXiv Detail & Related papers (2023-07-17T07:12:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.