Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
- URL: http://arxiv.org/abs/2410.10733v6
- Date: Wed, 26 Feb 2025 17:56:06 GMT
- Title: Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
- Authors: Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han,
- Abstract summary: Deep Compression Autoencoder (DC-AE) is a new family of autoencoder models for accelerating high-resolution diffusion models.<n>Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop.
- Score: 38.84567900296605
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. Our code is available at https://github.com/mit-han-lab/efficientvit.
Related papers
- H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models [76.1519545010611]
Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation.
In this work, we examine the architecture design choices and optimize the computation distribution to obtain efficient and high-compression video AEs.
Our AE achieves an ultra-high compression ratio and real-time decoding speed on mobile while outperforming prior art in terms of reconstruction metrics.
arXiv Detail & Related papers (2025-04-14T17:59:06Z) - REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder [52.698595889988766]
We present a novel perspective on learning video embedders for generative modeling.
Rather than requiring an exact reproduction of an input video, an effective embedder should focus on visually plausible reconstructions.
We propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework.
arXiv Detail & Related papers (2025-03-11T17:51:07Z) - Representing 3D Shapes With 64 Latent Vectors for 3D Diffusion Models [21.97308739556984]
COD-VAE encodes 3D shapes into a COmpact set of 1D latent vectors without sacrificing quality.
COD-VAE achieves 16x compression compared to the baseline while maintaining quality.
This enables 20.8x speedup in generation, highlighting that a large number of latent vectors is not a prerequisite for high-quality reconstruction and generation.
arXiv Detail & Related papers (2025-03-11T06:29:39Z) - Toward Lightweight and Fast Decoders for Diffusion Models in Image and Video Generation [0.0]
Large Variational Autoencoder decoders can slow down generation and consume considerable GPU memory.
We propose custom-trained decoders using lightweight Vision Transformer and Taming Transformer architectures.
Experiments show up to 15% overall speed-ups for image generation on COCO 2017 and up to 20 times faster decoding in the sub-module, with additional gains on UCF-101 for video tasks.
arXiv Detail & Related papers (2025-03-06T16:21:49Z) - Towards Practical Real-Time Neural Video Compression [60.390180067626396]
We introduce a practical real-time neural video (NVC) designed to deliver high compression ratio, low latency and broad versatility.
Experiments show our proposed DCVC-RT achieves an impressive average encoding/desampling speed 125.2/112.8 (frames per second) for 1080p video, while saving an average of 21% in fps compared to H.266/VTM.
arXiv Detail & Related papers (2025-02-28T06:32:23Z) - Improving the Diffusability of Autoencoders [54.920783089085035]
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos.
We perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces.
We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality.
arXiv Detail & Related papers (2025-02-20T18:45:44Z) - Accelerating Learned Video Compression via Low-Resolution Representation Learning [18.399027308582596]
We introduce an efficiency-optimized framework for learned video compression that focuses on low-resolution representation learning.
Our method achieves performance levels on par with the low-decay P configuration of the H.266 reference software VTM.
arXiv Detail & Related papers (2024-07-23T12:02:57Z) - Hierarchical Patch Diffusion Models for High-Resolution Video Generation [50.42746357450949]
We develop deep context fusion, which propagates context information from low-scale to high-scale patches in a hierarchical manner.
We also propose adaptive computation, which allocates more network capacity and computation towards coarse image details.
The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation.
arXiv Detail & Related papers (2024-06-12T01:12:53Z) - Extreme Encoder Output Frame Rate Reduction: Improving Computational
Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.
We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z) - Low-complexity Deep Video Compression with A Distributed Coding
Architecture [4.5885672744218]
Prevalent predictive coding-based video compression methods rely on a heavy encoder to reduce temporal redundancy.
Traditional distributed coding methods suffer from a substantial performance gap to predictive coding ones.
We propose the first end-to-end distributed deep video compression framework to improve rate-distortion performance.
arXiv Detail & Related papers (2023-03-21T05:34:04Z) - A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes [54.83802872236367]
We propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios.
The proposed large-medium model has 30% smaller size and reduces power consumption by 33%, compared to the baseline cascaded encoder model.
The triple-size model that unifies the large, medium, and small models achieves 37% total size reduction with minimal quality loss.
arXiv Detail & Related papers (2022-04-13T04:15:51Z) - Instance-Adaptive Video Compression: Improving Neural Codecs by Training
on the Test Set [14.89208053104896]
We introduce a video compression algorithm based on instance-adaptive learning.
On each video sequence to be transmitted, we finetune a pretrained compression model.
We show that it enables a competitive performance even after reducing the network size by 70%.
arXiv Detail & Related papers (2021-11-19T16:25:34Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - Conditional Entropy Coding for Efficient Video Compression [82.35389813794372]
We propose a very simple and efficient video compression framework that only focuses on modeling the conditional entropy between frames.
We first show that a simple architecture modeling the entropy between the image latent codes is as competitive as other neural video compression works and video codecs.
We then propose a novel internal learning extension on top of this architecture that brings an additional 10% savings without trading off decoding speed.
arXiv Detail & Related papers (2020-08-20T20:01:59Z) - Video compression with low complexity CNN-based spatial resolution
adaptation [15.431248645312309]
spatial resolution adaptation can be integrated within video compression to improve overall coding performance.
A novel framework is proposed which supports the flexible allocation of complexity between the encoder and decoder.
arXiv Detail & Related papers (2020-07-29T10:20:36Z) - A Unified End-to-End Framework for Efficient Deep Image Compression [35.156677716140635]
We propose a unified framework called Efficient Deep Image Compression (EDIC) based on three new technologies.
Specifically, we design an auto-encoder style network for learning based image compression.
Our EDIC method can also be readily incorporated with the Deep Video Compression (DVC) framework to further improve the video compression performance.
arXiv Detail & Related papers (2020-02-09T14:21:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.