LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models
- URL: http://arxiv.org/abs/2503.14325v1
- Date: Tue, 18 Mar 2025 14:58:59 GMT
- Title: LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models
- Authors: Yu Cheng, Fajie Yuan,
- Abstract summary: We propose LeanVAE, a novel and ultra-efficient Video VAE framework.<n>Our model offers up to 50x fewer FLOPs and 44x faster inference speed.<n>Our experiments validate LeanVAE's superiority in video reconstruction and generation.
- Score: 17.29580459404157
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent space.However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video VAEs.Our model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video generation.Our models and code are available at https://github.com/westlake-repl/LeanVAE.
Related papers
- Plug-and-Play Versatile Compressed Video Enhancement [57.62582951699999]
Video compression effectively reduces the size of files, making it possible for real-time cloud computing.
However, it comes at the cost of visual quality, challenges the robustness of downstream vision models.
We present a versatile-aware enhancement framework that adaptively enhance videos under different compression settings.
arXiv Detail & Related papers (2025-04-21T18:39:31Z) - H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models [76.1519545010611]
Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation.
In this work, we examine the architecture design choices and optimize the computation distribution to obtain efficient and high-compression video AEs.
Our AE achieves an ultra-high compression ratio and real-time decoding speed on mobile while outperforming prior art in terms of reconstruction metrics.
arXiv Detail & Related papers (2025-04-14T17:59:06Z) - REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder [52.698595889988766]
We present a novel perspective on learning video embedders for generative modeling.<n>Rather than requiring an exact reproduction of an input video, an effective embedder should focus on visually plausible reconstructions.<n>We propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework.
arXiv Detail & Related papers (2025-03-11T17:51:07Z) - Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
Diffusion Conditioned-based Gene Tokenizer replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times$ inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z) - Large Motion Video Autoencoding with Cross-modal Video VAE [52.13379965800485]
Video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation.<n>Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance.<n>We present a novel and powerful video autoencoder capable of high-fidelity video encoding.
arXiv Detail & Related papers (2024-12-23T18:58:24Z) - WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model [15.171544722138806]
Video Variational Autoencoder (VAE) encodes videos into a low-dimensional latent space.<n>VAE is a key component of most Latent Video Diffusion Models (LVDMs)
arXiv Detail & Related papers (2024-11-26T14:23:53Z) - OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model [33.766339921655025]
Variational Autoencoder (VAE) compressing videos into latent representations is a crucial component of Latent Video Diffusion Models (LVDMs)
Most LVDMs utilize 2D image VAE, whose compression for videos is only in the spatial dimension and often ignored in the temporal dimension.
We propose an omni-dimension compression VAE, named OD-VAE, which can temporally and spatially compress videos.
arXiv Detail & Related papers (2024-09-02T12:20:42Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.<n>During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.<n>Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.