Related papers: H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models

H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models

URL: http://arxiv.org/abs/2504.10567v2
Date: Wed, 01 Oct 2025 03:41:01 GMT
Title: H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models
Authors: Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, Sergey Tulyakov,
Abstract summary: Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation.<n>H3AE achieves ultra-high compression ratios and real-time decoding speed on GPU and mobile.
Score: 97.45170082949552
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network design, compression ratio, and training strategy. In this work, we systematically examine the architecture design choices and optimize the computation distribution to obtain a series of efficient and high-compression video AEs that can decode in real time even on mobile devices. We also propose an omni-training objective to unify the design of plain Autoencoder and image-conditioned I2V VAE, achieving multifunctionality in a single VAE network but with enhanced quality. In addition, we propose a novel latent consistency loss that provides stable improvements in reconstruction quality. Latent consistency loss outperforms prior auxiliary losses including LPIPS, GAN and DWT in terms of both quality improvements and simplicity. H3AE achieves ultra-high compression ratios and real-time decoding speed on GPU and mobile, and outperforms prior arts in terms of reconstruction metrics by a large margin. We finally validate our AE by training a DiT on its latent space and demonstrate fast, high-quality text-to-video generation capability.

Related papers

FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution [68.77813885751308]
State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information.<n>We propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data.<n>Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames.
arXiv Detail & Related papers (2025-06-13T07:59:52Z)
DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning [42.22785629783251]
Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization.<n>Recent advances have alleviated the performance degradation of autoencoders under high compression ratios, but training instability caused by GAN remains an open challenge.<n>We propose DGAE, which employs a diffusion model to guide the decoder in recovering informative signals that are not fully decoded from the latent representation.
arXiv Detail & Related papers (2025-06-11T12:01:03Z)
Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion [23.80254637449824]
Hi-VAE formulates an efficient video autoencoding framework that encodes coarse-to-fine motion representations of video dynamics.<n>We show that Hi-VAE exhibits a high compression factor of 1428$times$, almost 30$times$ higher than baseline methods.
arXiv Detail & Related papers (2025-06-08T13:30:11Z)
LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models [17.29580459404157]
We propose LeanVAE, a novel and ultra-efficient Video VAE framework. Our model offers up to 50x fewer FLOPs and 44x faster inference speed. Our experiments validate LeanVAE's superiority in video reconstruction and generation.
arXiv Detail & Related papers (2025-03-18T14:58:59Z)
Pathology Image Compression with Pre-trained Autoencoders [52.208181380986524]
Whole Slide Images in digital histopathology pose significant storage, transmission, and computational efficiency challenges.<n>Standard compression methods, such as JPEG, reduce file sizes but fail to preserve fine-grained phenotypic details critical for downstream tasks.<n>In this work, we repurpose autoencoders (AEs) designed for Latent Diffusion Models as an efficient learned compression framework for pathology images.
arXiv Detail & Related papers (2025-03-14T17:01:17Z)
REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder [52.698595889988766]
We present a novel perspective on learning video embedders for generative modeling.<n>Rather than requiring an exact reproduction of an input video, an effective embedder should focus on visually plausible reconstructions.<n>We propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework.
arXiv Detail & Related papers (2025-03-11T17:51:07Z)
Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z)
Large Motion Video Autoencoding with Cross-modal Video VAE [52.13379965800485]
Video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation.<n>Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance.<n>We present a novel and powerful video autoencoder capable of high-fidelity video encoding.
arXiv Detail & Related papers (2024-12-23T18:58:24Z)
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models [38.84567900296605]
Deep Compression Autoencoder (DC-AE) is a new family of autoencoder models for accelerating high-resolution diffusion models.<n>Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop.
arXiv Detail & Related papers (2024-10-14T17:15:07Z)
Compression-Realized Deep Structural Network for Video Quality Enhancement [78.13020206633524]
This paper focuses on the task of quality enhancement for compressed videos. Most of the existing methods lack a structured design to optimally leverage the priors within compression codecs. A new paradigm is urgently needed for a more conscious'' process of quality enhancement.
arXiv Detail & Related papers (2024-05-10T09:18:17Z)
Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge Devices [90.30316433184414]
We propose a data-model-hardware tri-design framework for high- throughput, low-cost, and high-accuracy MOT on HD video stream. Compared to the state-of-the-art MOT baseline, our tri-design approach can achieve 12.5x latency reduction, 20.9x effective frame rate improvement, 5.83x lower power, and 9.78x better energy efficiency, without much accuracy drop.
arXiv Detail & Related papers (2022-10-16T16:21:40Z)
A Unified End-to-End Framework for Efficient Deep Image Compression [35.156677716140635]
We propose a unified framework called Efficient Deep Image Compression (EDIC) based on three new technologies. Specifically, we design an auto-encoder style network for learning based image compression. Our EDIC method can also be readily incorporated with the Deep Video Compression (DVC) framework to further improve the video compression performance.
arXiv Detail & Related papers (2020-02-09T14:21:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.