Related papers: DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

URL: http://arxiv.org/abs/2509.25182v1
Date: Mon, 29 Sep 2025 17:59:31 GMT
Title: DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder
Authors: Junyu Chen, Wenkun He, Yuchao Gu, Yuyang Zhao, Jincheng Yu, Junsong Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Muyang Li, Haocheng Xi, Ligeng Zhu, Enze Xie, Song Han, Han Cai,
Abstract summary: DC-VideoGen can be applied to any pre-trained video diffusion model.<n>It can be adapted to a deep compression latent space with lightweight fine-tuning.
Score: 55.26098043655325
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: https://github.com/dc-ai-projects/DC-VideoGen.

Related papers

Helios: Real Real-Time Long Video Generation Model [33.34372252025333]
Helios is a 14B autoregressive diffusion model with a unified input representation that supports T2V, I2V, and V2V tasks.<n>Helios consistently outperforms prior methods on both short- and long-video generation.<n>We plan to release the code, base model, and distilled model to support further development by the community.
arXiv Detail & Related papers (2026-03-04T18:45:21Z)
DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space [49.28906188484785]
Existing text-to-image diffusion models excel at generating high-quality images, but face significant efficiency challenges when scaled to high resolutions.<n>This paper introduces DC-Gen, a framework that accelerates text-to-image diffusion models by leveraging a deeply compressed latent space.<n>Specifically, DC-Gen-FLUX reduces the latency of 4K image generation by 53x on the NVIDIA H100 GPU.
arXiv Detail & Related papers (2025-09-29T17:59:25Z)
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer [116.17385614259574]
We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration.<n>Two core designs ensure our efficient, effective and long video generation.<n>Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models.
arXiv Detail & Related papers (2025-09-29T12:28:09Z)
GSVR: 2D Gaussian-based Video Representation for 800+ FPS with Hybrid Deformation Field [7.977026024810772]
Implicit neural representations for video have been recognized as a novel and promising video representation.<n>We propose GSVR, a novel 2D Gaussian-based video representation, which achieves 800+ FPS and 35+ PSNR on Bunny.<n>Our method converges much faster than existing methods and also has 10x faster decoding speed compared to other methods.
arXiv Detail & Related papers (2025-07-08T02:13:12Z)
GaussianVideo: Efficient Video Representation and Compression by Gaussian Splatting [10.568851068989973]
Implicit Neural Representation for Videos (NeRV) has introduced a novel paradigm for video representation and compression.<n>We propose a new video representation and method based on 2D Gaussian Splatting to efficiently handle data handle.<n>Our method reduces memory usage by up to 78.4%, and significantly expedites video processing, achieving 5.5x faster training and 12.5x faster decoding.
arXiv Detail & Related papers (2025-03-06T11:31:08Z)
LTX-Video: Realtime Video Latent Diffusion [4.7789714048042775]
LTX-Video is a transformer-based latent diffusion model.<n>It seamlessly integrates the Video-VAE and the denoising transformer.<n>It achieves faster-than-real-time generation, producing 5 seconds of 24 fps video 768 atx512 resolution in just 2 seconds on an Nvidia H100 GPU.
arXiv Detail & Related papers (2024-12-30T19:00:25Z)
Large Motion Video Autoencoding with Cross-modal Video VAE [52.13379965800485]
Video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation.<n>Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance.<n>We present a novel and powerful video autoencoder capable of high-fidelity video encoding.
arXiv Detail & Related papers (2024-12-23T18:58:24Z)
REDUCIO! Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents [110.41795676048835]
One crucial obstacle for large-scale applications is the expensive training and inference cost.<n>We argue that videos contain significantly more redundant information than images, allowing them to be encoded with very few motion latents.<n>We design an image-conditioned VAE that projects videos into extremely compressed latent space and decode them based on content images.
arXiv Detail & Related papers (2024-11-20T18:59:52Z)
MagicVideo: Efficient Video Generation With Latent Diffusion Models [76.95903791630624]
We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card. We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content.
arXiv Detail & Related papers (2022-11-20T16:40:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.