Related papers: DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space

DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space

URL: http://arxiv.org/abs/2508.00413v1
Date: Fri, 01 Aug 2025 08:11:07 GMT
Title: DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space
Authors: Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, Han Cai,
Abstract summary: We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models.<n>We introduce two key innovations to address this challenge: Structured Latent Space and Augmented Diffusion Training.<n>On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster.
Score: 31.531194096383896
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder's latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. Code: https://github.com/dc-ai-projects/DC-Gen.

Related papers

LCUDiff: Latent Capacity Upgrade Diffusion for Faithful Human Body Restoration [23.264518366939825]
Existing methods for restoring degraded human-centric images often struggle with insufficient fidelity.<n>We propose LCUDiff, a stable one-step framework that upgrades a pre-trained latent diffusion model.<n> Experiments on synthetic and real-world datasets show competitive results with higher fidelity and fewer artifacts.
arXiv Detail & Related papers (2026-02-04T10:37:46Z)
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation [93.6273078684831]
We propose a frequency-DeCoupled pixel diffusion framework to pursue a more efficient pixel diffusion paradigm.<n>With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance.<n>Experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet.
arXiv Detail & Related papers (2025-11-24T17:59:06Z)
DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space [49.28906188484785]
Existing text-to-image diffusion models excel at generating high-quality images, but face significant efficiency challenges when scaled to high resolutions.<n>This paper introduces DC-Gen, a framework that accelerates text-to-image diffusion models by leveraging a deeply compressed latent space.<n>Specifically, DC-Gen-FLUX reduces the latency of 4K image generation by 53x on the NVIDIA H100 GPU.
arXiv Detail & Related papers (2025-09-29T17:59:25Z)
CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers [72.23291099555459]
Diffusion-based generative models have become dominant generators of high-fidelity images and videos but remain limited by their computationally expensive inference procedures.<n>This paper explores a general, training-free, and model-agnostic acceleration strategy via multi-core parallelism.<n>ChoRDS significantly accelerates sampling across diverse large-scale image and video diffusion models, yielding up to 2.1x speedup with four cores, improving by 50% over baselines, and 2.9x speedup with eight cores, all without quality degradation.
arXiv Detail & Related papers (2025-07-21T05:48:47Z)
DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning [42.22785629783251]
Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization.<n>Recent advances have alleviated the performance degradation of autoencoders under high compression ratios, but training instability caused by GAN remains an open challenge.<n>We propose DGAE, which employs a diffusion model to guide the decoder in recovering informative signals that are not fully decoded from the latent representation.
arXiv Detail & Related papers (2025-06-11T12:01:03Z)
DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling [53.33281984430122]
Diffusion Transformer (DiT) is a promising diffusion model for visual generation but incurs significant computational overhead.<n>In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models.<n>We introduce Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules.
arXiv Detail & Related papers (2025-05-16T12:54:04Z)
H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models [76.1519545010611]
Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation.<n>In this work, we examine the architecture design choices and optimize the computation distribution to obtain efficient and high-compression video AEs.<n>Our AE achieves an ultra-high compression ratio and real-time decoding speed on mobile while outperforming prior art in terms of reconstruction metrics.
arXiv Detail & Related papers (2025-04-14T17:59:06Z)
One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.<n>To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.<n>Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z)
Improving the Diffusability of Autoencoders [54.920783089085035]
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos.<n>We perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces.<n>We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality.
arXiv Detail & Related papers (2025-02-20T18:45:44Z)
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models [38.84567900296605]
Deep Compression Autoencoder (DC-AE) is a new family of autoencoder models for accelerating high-resolution diffusion models.<n>Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop.
arXiv Detail & Related papers (2024-10-14T17:15:07Z)
Latent Denoising Diffusion GAN: Faster sampling, Higher image quality [0.0]
Latent Denoising Diffusion GAN employs pre-trained autoencoders to compress images into a compact latent space. Compared to its predecessors, DiffusionGAN and Wavelet Diffusion, our model shows remarkable improvements in all evaluation metrics.
arXiv Detail & Related papers (2024-06-17T16:32:23Z)
DeepCache: Accelerating Diffusion Models for Free [65.02607075556742]
DeepCache is a training-free paradigm that accelerates diffusion models from the perspective of model architecture. DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models. Under the same throughput, DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS.
arXiv Detail & Related papers (2023-12-01T17:01:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.