Conditional Video Generation for High-Efficiency Video Compression
- URL: http://arxiv.org/abs/2507.15269v1
- Date: Mon, 21 Jul 2025 06:16:27 GMT
- Title: Conditional Video Generation for High-Efficiency Video Compression
- Authors: Fangqiu Yi, Jingyu Xu, Jiawei Shao, Chi Zhang, Xuelong Li,
- Abstract summary: We propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction.<n>Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals.
- Score: 47.011087624381524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Perceptual studies demonstrate that conditional diffusion models excel at reconstructing video content aligned with human visual perception. Building on this insight, we propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction. Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals. Our approach introduces three key modules: (1) Multi-granular conditioning that captures both static scene structure and dynamic spatio-temporal cues; (2) Compact representations designed for efficient transmission without sacrificing semantic richness; (3) Multi-condition training with modality dropout and role-aware embeddings, which prevent over-reliance on any single modality and enhance robustness. Extensive experiments show that our method significantly outperforms both traditional and neural codecs on perceptual quality metrics such as Fr\'echet Video Distance (FVD) and LPIPS, especially under high compression ratios.
Related papers
- REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder [52.698595889988766]
We present a novel perspective on learning video embedders for generative modeling.<n>Rather than requiring an exact reproduction of an input video, an effective embedder should focus on visually plausible reconstructions.<n>We propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework.
arXiv Detail & Related papers (2025-03-11T17:51:07Z) - Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z) - Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces [20.860632218272094]
Video tokenizers are essential for latent video diffusion models, converting raw video data into latent spaces for efficient training.<n>We propose an alternative approach to enhance temporal compression.<n>We develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models.
arXiv Detail & Related papers (2025-01-09T18:55:15Z) - M3-CVC: Controllable Video Compression with Multimodal Generative Models [17.49397141459785]
M3-CVC is a controllable video compression framework incorporating generative models.<n>We show that M3-CVC significantly outperforms the state-the-art VVC standard in ultralow scenarios.
arXiv Detail & Related papers (2024-11-24T11:56:59Z) - Improved Video VAE for Latent Video Diffusion Model [55.818110540710215]
Video Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora.
Most of existing VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression.
We propose a new KTC architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE)
arXiv Detail & Related papers (2024-11-10T12:43:38Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.<n>During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.<n>Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - Compression-Realized Deep Structural Network for Video Quality Enhancement [78.13020206633524]
This paper focuses on the task of quality enhancement for compressed videos.
Most of the existing methods lack a structured design to optimally leverage the priors within compression codecs.
A new paradigm is urgently needed for a more conscious'' process of quality enhancement.
arXiv Detail & Related papers (2024-05-10T09:18:17Z) - Scene Matters: Model-based Deep Video Compression [13.329074811293292]
We propose a model-based video compression (MVC) framework that regards scenes as the fundamental units for video sequences.
Our proposed MVC directly models novel intensity variation of the entire video sequence in one scene, seeking non-redundant representations instead of reducing redundancy.
Our method achieves up to a 20% reduction compared to the latest video standard H.266 and is more efficient in decoding than existing video coding strategies.
arXiv Detail & Related papers (2023-03-08T13:15:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.