Related papers: Free-GVC: Towards Training-Free Extreme Generative Video Compression with Temporal Coherence

Free-GVC: Towards Training-Free Extreme Generative Video Compression with Temporal Coherence

URL: http://arxiv.org/abs/2602.09868v1
Date: Tue, 10 Feb 2026 15:12:51 GMT
Title: Free-GVC: Towards Training-Free Extreme Generative Video Compression with Temporal Coherence
Authors: Xiaoyue Ling, Chuqin Zhou, Chunyi Li, Yunuo Chen, Yuan Tian, Guo Lu, Wenjun Zhang,
Abstract summary: Free-GVC is a training-free generative video compression framework.<n>Our method operates at the group-of-pictures level, encoding video segments into a compact latent space.<n>Experiments show that Free-GVC achieves an average of 93.29% BD-Rate reduction in DISTS over the latest neural DCVC-RT.
Score: 30.812937732503457
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Building on recent advances in video generation, generative video compression has emerged as a new paradigm for achieving visually pleasing reconstructions. However, existing methods exhibit limited exploitation of temporal correlations, causing noticeable flicker and degraded temporal coherence at ultra-low bitrates. In this paper, we propose Free-GVC, a training-free generative video compression framework that reformulates video coding as latent trajectory compression guided by a video diffusion prior. Our method operates at the group-of-pictures (GOP) level, encoding video segments into a compact latent space and progressively compressing them along the diffusion trajectory. To ensure perceptually consistent reconstruction across GOPs, we introduce an Adaptive Quality Control module that dynamically constructs an online rate-perception surrogate model to predict the optimal diffusion step for each GOP. In addition, an Inter-GOP Alignment module establishes frame overlap and performs latent fusion between adjacent groups, thereby mitigating flicker and enhancing temporal coherence. Experiments show that Free-GVC achieves an average of 93.29% BD-Rate reduction in DISTS over the latest neural codec DCVC-RT, and a user study further confirms its superior perceptual quality and temporal coherence at ultra-low bitrates.

Related papers

MTC-VAE: Multi-Level Temporal Compression with Content Awareness [54.85288415164888]
Latent Video Diffusion Models (LVDMs) rely on Variational Autoencoders (VAEs) to compress videos into compact latent representations.<n>We present a technique to convert fixed compression rate VAEs into models that support multi-level temporal compression.
arXiv Detail & Related papers (2026-02-01T17:08:02Z)
Generative Neural Video Compression via Video Diffusion Prior [33.164111717707414]
DiT-based generative neural video compression framework built upon advanced video generation foundation model.<n>First DiT-based generative neural video compression framework built upon advanced video generation foundation model.
arXiv Detail & Related papers (2025-12-04T17:27:32Z)
Generative Latent Video Compression [26.99743586846841]
We present Generative Latent Video Compression (GLVC), an effective framework for perceptual video compression.<n>GLVC employs a pretrained continuous tokenizer to project video frames into a perceptually aligned latent space.<n>We show GLVC achieves state-of-the-art performance in terms of DISTS and LPIPS metrics.
arXiv Detail & Related papers (2025-10-11T03:28:49Z)
REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder [52.698595889988766]
We present a novel perspective on learning video embedders for generative modeling.<n>Rather than requiring an exact reproduction of an input video, an effective embedder should focus on visually plausible reconstructions.<n>We propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework.
arXiv Detail & Related papers (2025-03-11T17:51:07Z)
Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z)
Improved Video VAE for Latent Video Diffusion Model [55.818110540710215]
Video Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora. Most of existing VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression. We propose a new KTC architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE)
arXiv Detail & Related papers (2024-11-10T12:43:38Z)
Learned Video Compression via Heterogeneous Deformable Compensation Network [78.72508633457392]
We propose a learned video compression framework via heterogeneous deformable compensation strategy (HDCVC) to tackle the problems of unstable compression performance. More specifically, the proposed algorithm extracts features from the two adjacent frames to estimate content-Neighborhood heterogeneous deformable (HetDeform) kernel offsets. Experimental results indicate that HDCVC achieves superior performance than the recent state-of-the-art learned video compression approaches.
arXiv Detail & Related papers (2022-07-11T02:31:31Z)
Perceptual Learned Video Compression with Recurrent Conditional GAN [158.0726042755]
We propose a Perceptual Learned Video Compression (PLVC) approach with recurrent conditional generative adversarial network. PLVC learns to compress video towards good perceptual quality at low bit-rate. The user study further validates the outstanding perceptual performance of PLVC in comparison with the latest learned video compression approaches.
arXiv Detail & Related papers (2021-09-07T13:36:57Z)
Decomposition, Compression, and Synthesis (DCS)-based Video Coding: A Neural Exploration via Resolution-Adaptive Learning [30.54722074562783]
We decompose the input video into respective spatial texture frames (STF) at its native spatial resolution. Then, we compress them together using any popular video coder. Finally, we synthesize decoded STFs and TMFs for high-quality video reconstruction at the same resolution as its native input.
arXiv Detail & Related papers (2020-12-01T17:23:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.