Related papers: Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

URL: http://arxiv.org/abs/2501.05442v1
Date: Thu, 09 Jan 2025 18:55:15 GMT
Title: Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces
Authors: Aniruddha Mahapatra, Long Mai, Yitian Zhang, David Bourgin, Feng Liu,
Abstract summary: Video tokenizers are essential for latent video diffusion models, converting raw video data into latent spaces for efficient training.<n>We propose an alternative approach to enhance temporal compression.<n>We develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models.
Score: 20.860632218272094
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4x without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a cross-level feature-mixing module to retain information from the pretrained low-compression model and guide higher-compression blocks to capture the remaining details from the full video sequence. Evaluation of video benchmarks shows that our method significantly improves reconstruction quality while increasing temporal compression compared to direct extensions of existing video tokenizers. Furthermore, the resulting compact latent space effectively trains a video diffusion model for high-quality video generation with a reduced token budget.

Related papers

Conditional Video Generation for High-Efficiency Video Compression [47.011087624381524]
We propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction.<n>Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals.
arXiv Detail & Related papers (2025-07-21T06:16:27Z)
Embedding Compression Distortion in Video Coding for Machines [67.97469042910855]
Currently, video transmission serves not only the Human Visual System (HVS) for viewing but also machine perception for analysis. We propose a Compression Distortion Embedding (CDRE) framework, which extracts machine-perception-related distortion representation and embeds it into downstream models. Our framework can effectively boost the rate-task performance of existing codecs with minimal overhead in terms of execution time, and number of parameters.
arXiv Detail & Related papers (2025-03-27T13:01:53Z)
REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder [52.698595889988766]
We present a novel perspective on learning video embedders for generative modeling. Rather than requiring an exact reproduction of an input video, an effective embedder should focus on visually plausible reconstructions. We propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework.
arXiv Detail & Related papers (2025-03-11T17:51:07Z)
Spatial Degradation-Aware and Temporal Consistent Diffusion Model for Compressed Video Super-Resolution [13.103621878352314]
Video super-resolution (VSR) is an efficient technique to enhance video, but relatively VSR methods focus on compressed videos. We propose a novel Spatial Degradation-Aware and Temporal Consistent (ATC) diffusion model for compressed VSR.
arXiv Detail & Related papers (2025-02-11T08:57:45Z)
Large Motion Video Autoencoding with Cross-modal Video VAE [52.13379965800485]
Video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation.<n>Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance.<n>We present a novel and powerful video autoencoder capable of high-fidelity video encoding.
arXiv Detail & Related papers (2024-12-23T18:58:24Z)
Learned Video Compression via Heterogeneous Deformable Compensation Network [78.72508633457392]
We propose a learned video compression framework via heterogeneous deformable compensation strategy (HDCVC) to tackle the problems of unstable compression performance. More specifically, the proposed algorithm extracts features from the two adjacent frames to estimate content-Neighborhood heterogeneous deformable (HetDeform) kernel offsets. Experimental results indicate that HDCVC achieves superior performance than the recent state-of-the-art learned video compression approaches.
arXiv Detail & Related papers (2022-07-11T02:31:31Z)
Leveraging Bitstream Metadata for Fast, Accurate, Generalized Compressed Video Quality Enhancement [74.1052624663082]
We develop a deep learning architecture capable of restoring detail to compressed videos. We show that this improves restoration accuracy compared to prior compression correction methods. We condition our model on quantization data which is readily available in the bitstream.
arXiv Detail & Related papers (2022-01-31T18:56:04Z)
COMISR: Compression-Informed Video Super-Resolution [76.94152284740858]
Most videos on the web or mobile devices are compressed, and the compression can be severe when the bandwidth is limited. We propose a new compression-informed video super-resolution model to restore high-resolution content without introducing artifacts caused by compression.
arXiv Detail & Related papers (2021-05-04T01:24:44Z)
Feedback Recurrent Autoencoder for Video Compression [14.072596106425072]
We propose a new network architecture for learned video compression operating in low latency mode. Our method yields state of the art MS-SSIM/rate performance on the high-resolution UVG dataset.
arXiv Detail & Related papers (2020-04-09T02:58:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.