Related papers: DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning

DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning

URL: http://arxiv.org/abs/2506.09644v1
Date: Wed, 11 Jun 2025 12:01:03 GMT
Title: DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning
Authors: Dongxu Liu, Yuang Peng, Haomiao Tang, Yuwei Chen, Chunrui Han, Zheng Ge, Daxin Jiang, Mingxue Liao,
Abstract summary: Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization.<n>Recent advances have alleviated the performance degradation of autoencoders under high compression ratios, but training instability caused by GAN remains an open challenge.<n>We propose DGAE, which employs a diffusion model to guide the decoder in recovering informative signals that are not fully decoded from the latent representation.
Score: 42.22785629783251
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization. Although recent advances have alleviated the performance degradation of autoencoders under high compression ratios, addressing the training instability caused by GAN remains an open challenge. While improving spatial compression, we also aim to minimize the latent space dimensionality, enabling more efficient and compact representations. To tackle these challenges, we focus on improving the decoder's expressiveness. Concretely, we propose DGAE, which employs a diffusion model to guide the decoder in recovering informative signals that are not fully decoded from the latent representation. With this design, DGAE effectively mitigates the performance degradation under high spatial compression rates. At the same time, DGAE achieves state-of-the-art performance with a 2x smaller latent space. When integrated with Diffusion Models, DGAE demonstrates competitive performance on image generation for ImageNet-1K and shows that this compact latent representation facilitates faster convergence of the diffusion model.

Related papers

Ultra Lowrate Image Compression with Semantic Residual Coding and Compression-aware Diffusion [28.61304513668606]
ResULIC is a residual-guided ultra lowrate image compression system.<n>It incorporates residual signals into both semantic retrieval and the diffusion-based generation process.<n>It achieves superior objective and subjective performance compared to state-of-the-art diffusion-based methods.
arXiv Detail & Related papers (2025-05-13T06:51:23Z)
H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models [76.1519545010611]
Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation.<n>In this work, we examine the architecture design choices and optimize the computation distribution to obtain efficient and high-compression video AEs.<n>Our AE achieves an ultra-high compression ratio and real-time decoding speed on mobile while outperforming prior art in terms of reconstruction metrics.
arXiv Detail & Related papers (2025-04-14T17:59:06Z)
Embedding Compression Distortion in Video Coding for Machines [67.97469042910855]
Currently, video transmission serves not only the Human Visual System (HVS) for viewing but also machine perception for analysis.<n>We propose a Compression Distortion Embedding (CDRE) framework, which extracts machine-perception-related distortion representation and embeds it into downstream models.<n>Our framework can effectively boost the rate-task performance of existing codecs with minimal overhead in terms of execution time, and number of parameters.
arXiv Detail & Related papers (2025-03-27T13:01:53Z)
MambaIC: State Space Models for High-Performance Learned Image Compression [53.991726013454695]
A high-performance image compression algorithm is crucial for real-time information transmission across numerous fields.<n>Inspired by the effectiveness of state space models (SSMs) in capturing long-range dependencies, we leverage SSMs to address computational inefficiency in existing methods.<n>We propose an enhanced image compression approach through refined context modeling, which we term MambaIC.
arXiv Detail & Related papers (2025-03-16T11:32:34Z)
REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder [52.698595889988766]
We present a novel perspective on learning video embedders for generative modeling.<n>Rather than requiring an exact reproduction of an input video, an effective embedder should focus on visually plausible reconstructions.<n>We propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework.
arXiv Detail & Related papers (2025-03-11T17:51:07Z)
Toward Lightweight and Fast Decoders for Diffusion Models in Image and Video Generation [0.0]
Large Variational Autoencoder decoders can slow down generation and consume considerable GPU memory.<n>We propose custom-trained decoders using lightweight Vision Transformer and Taming Transformer architectures.<n>Experiments show up to 15% overall speed-ups for image generation on COCO 2017 and up to 20 times faster decoding in the sub-module, with additional gains on UCF-101 for video tasks.
arXiv Detail & Related papers (2025-03-06T16:21:49Z)
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models [38.84567900296605]
Deep Compression Autoencoder (DC-AE) is a new family of autoencoder models for accelerating high-resolution diffusion models.<n>Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop.
arXiv Detail & Related papers (2024-10-14T17:15:07Z)
Epsilon-VAE: Denoising as Visual Decoding [61.29255979767292]
We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement.<n>Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image.<n>By adopting iterative reconstruction through diffusion, our autoencoder, namely Epsilon-VAE, achieves high reconstruction quality.
arXiv Detail & Related papers (2024-10-05T08:27:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.