DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space
- URL: http://arxiv.org/abs/2509.25180v2
- Date: Wed, 01 Oct 2025 02:18:37 GMT
- Title: DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space
- Authors: Wenkun He, Yuchao Gu, Junyu Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Haocheng Xi, Muyang Li, Ligeng Zhu, Jincheng Yu, Junsong Chen, Enze Xie, Song Han, Han Cai,
- Abstract summary: Existing text-to-image diffusion models excel at generating high-quality images, but face significant efficiency challenges when scaled to high resolutions.<n>This paper introduces DC-Gen, a framework that accelerates text-to-image diffusion models by leveraging a deeply compressed latent space.<n>Specifically, DC-Gen-FLUX reduces the latency of 4K image generation by 53x on the NVIDIA H100 GPU.
- Score: 49.28906188484785
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Existing text-to-image diffusion models excel at generating high-quality images, but face significant efficiency challenges when scaled to high resolutions, like 4K image generation. While previous research accelerates diffusion models in various aspects, it seldom handles the inherent redundancy within the latent space. To bridge this gap, this paper introduces DC-Gen, a general framework that accelerates text-to-image diffusion models by leveraging a deeply compressed latent space. Rather than a costly training-from-scratch approach, DC-Gen uses an efficient post-training pipeline to preserve the quality of the base model. A key challenge in this paradigm is the representation gap between the base model's latent space and a deeply compressed latent space, which can lead to instability during direct fine-tuning. To overcome this, DC-Gen first bridges the representation gap with a lightweight embedding alignment training. Once the latent embeddings are aligned, only a small amount of LoRA fine-tuning is needed to unlock the base model's inherent generation quality. We verify DC-Gen's effectiveness on SANA and FLUX.1-Krea. The resulting DC-Gen-SANA and DC-Gen-FLUX models achieve quality comparable to their base models but with a significant speedup. Specifically, DC-Gen-FLUX reduces the latency of 4K image generation by 53x on the NVIDIA H100 GPU. When combined with NVFP4 SVDQuant, DC-Gen-FLUX generates a 4K image in just 3.5 seconds on a single NVIDIA 5090 GPU, achieving a total latency reduction of 138x compared to the base FLUX.1-Krea model. Code: https://github.com/dc-ai-projects/DC-Gen.
Related papers
- One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models [45.92038137978053]
We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code.<n>LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages.<n>A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines.
arXiv Detail & Related papers (2025-11-13T18:54:18Z) - DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder [55.26098043655325]
DC-VideoGen can be applied to any pre-trained video diffusion model.<n>It can be adapted to a deep compression latent space with lightweight fine-tuning.
arXiv Detail & Related papers (2025-09-29T17:59:31Z) - DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space [31.531194096383896]
We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models.<n>We introduce two key innovations to address this challenge: Structured Latent Space and Augmented Diffusion Training.<n>On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster.
arXiv Detail & Related papers (2025-08-01T08:11:07Z) - CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers [72.23291099555459]
Diffusion-based generative models have become dominant generators of high-fidelity images and videos but remain limited by their computationally expensive inference procedures.<n>This paper explores a general, training-free, and model-agnostic acceleration strategy via multi-core parallelism.<n>ChoRDS significantly accelerates sampling across diverse large-scale image and video diffusion models, yielding up to 2.1x speedup with four cores, improving by 50% over baselines, and 2.9x speedup with eight cores, all without quality degradation.
arXiv Detail & Related papers (2025-07-21T05:48:47Z) - DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer [32.64616770377737]
DC-AR is a novel masked autoregressive (AR) text-to-image generation framework.<n>It delivers superior image generation quality with exceptional computational efficiency.
arXiv Detail & Related papers (2025-07-07T12:45:23Z) - When Worse is Better: Navigating the compression-generation tradeoff in visual tokenization [92.17160980120404]
We introduce Causally Regularized Tokenization (CRT), which uses knowledge of the stage 2 generation modeling procedure to embed useful inductive biases in stage 1 latents.<n>CRT makes stage 1 reconstruction performance worse, but makes stage 2 generation performance better by making the tokens easier to model.<n>We match state-of-the-art discrete autoregressive ImageNet generation (2.18 FID) with less than half the tokens per image.
arXiv Detail & Related papers (2024-12-20T20:32:02Z) - Supercharged One-step Text-to-Image Diffusion Models with Negative Prompts [19.609393551644562]
We introduce textbfNegative-textbfAway textbfSteer textbfAttention (NASA), an efficient method that integrates negative prompts into one-step diffusion models.<n>NASA operates within the intermediate representation space by leveraging cross-attention mechanisms to suppress undesired visual attributes.
arXiv Detail & Related papers (2024-12-03T18:56:32Z) - HART: Efficient Visual Generation with Hybrid Autoregressive Transformer [33.97880303341509]
We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images.
Our approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38.
HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs.
arXiv Detail & Related papers (2024-10-14T17:59:42Z) - LinFusion: 1 GPU, 1 Minute, 16K Image [71.44735417472043]
We introduce a low-rank approximation of a wide spectrum of popular linear token mixers.
We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD.
Experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation.
arXiv Detail & Related papers (2024-09-03T17:54:39Z) - Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator
for Vision Applications [108.44482683870888]
We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications.
DCNv4 addresses the limitations of its predecessor, DCNv3, with two key enhancements.
It demonstrates exceptional performance across various tasks, including image classification, instance and semantic segmentation, and notably, image generation.
arXiv Detail & Related papers (2024-01-11T14:53:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.