Related papers: Improved Video VAE for Latent Video Diffusion Model

Improved Video VAE for Latent Video Diffusion Model

URL: http://arxiv.org/abs/2411.06449v1
Date: Sun, 10 Nov 2024 12:43:38 GMT
Title: Improved Video VAE for Latent Video Diffusion Model
Authors: Pingyu Wu, Kai Zhu, Yu Liu, Liming Zhao, Wei Zhai, Yang Cao, Zheng-Jun Zha,
Abstract summary: Video Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora. Most of existing VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression. We propose a new KTC architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE)
Score: 55.818110540710215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Variational Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora and other latent video diffusion generation models. While most of existing video VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression, this paper presents two astonishing findings: (1) The initialization from a well-trained image VAE with the same latent dimensions suppresses the improvement of subsequent temporal compression capabilities. (2) The adoption of causal reasoning leads to unequal information interactions and unbalanced performance between frames. To alleviate these problems, we propose a keyframe-based temporal compression (KTC) architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE). Specifically, the KTC architecture divides the latent space into two branches, in which one half completely inherits the compression prior of keyframes from a lower-dimension image VAE while the other half involves temporal compression through 3D group causal convolution, reducing temporal-spatial conflicts and accelerating the convergence speed of video VAE. The GCConv in above 3D half uses standard convolution within each frame group to ensure inter-frame equivalence, and employs causal logical padding between groups to maintain flexibility in processing variable frame video. Extensive experiments on five benchmarks demonstrate the SOTA video reconstruction and generation capabilities of the proposed IV-VAE (https://wpy1999.github.io/IV-VAE/).

Related papers

4DGC: Rate-Aware 4D Gaussian Compression for Efficient Streamable Free-Viewpoint Video [56.04182926886754]
3D Gaussian Splatting (3DGS) has substantial potential for enabling photorealistic Free-Viewpoint Video (FVV) experiences. Existing methods typically handle dynamic 3DGS representation and compression separately, motion information and the rate-distortion trade-off during training. We propose 4DGC, a rate-aware 4D Gaussian compression framework that significantly reduces storage size while maintaining superior RD performance for FVV.
arXiv Detail & Related papers (2025-03-24T08:05:27Z)
DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation [16.216254819711327]
We propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a training-free paradigm that can make use of adaptive temporal compression in latent space. Our simple but effective DLFR-VAE can function as a plug-and-play module, seamlessly integrating with existing video generation models.
arXiv Detail & Related papers (2025-02-17T15:22:31Z)
Large Motion Video Autoencoding with Cross-modal Video VAE [52.13379965800485]
Video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. We present a novel and powerful video autoencoder capable of high-fidelity video encoding.
arXiv Detail & Related papers (2024-12-23T18:58:24Z)
CV-VAE: A Compatible Video VAE for Latent Generative Video Models [45.702473834294146]
Variationalencoders (VAE) plays a crucial role in OpenAI's Auto-temporal compression of videos. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models. We propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE.
arXiv Detail & Related papers (2024-05-30T17:33:10Z)
IBVC: Interpolation-driven B-frame Video Compression [68.18440522300536]
B-frame video compression aims to adopt bi-directional motion estimation and motion compensation (MEMC) coding for middle frame reconstruction. Previous learned approaches often directly extend neural P-frame codecs to B-frame relying on bi-directional optical-flow estimation. We propose a simple yet effective structure called Interpolation-B-frame Video Compression (IBVC) to address these issues.
arXiv Detail & Related papers (2023-09-25T02:45:51Z)
VNVC: A Versatile Neural Video Coding Framework for Efficient Human-Machine Vision [59.632286735304156]
It is more efficient to enhance/analyze the coded representations directly without decoding them into pixels. We propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis.
arXiv Detail & Related papers (2023-06-19T03:04:57Z)
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query. Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames. We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
Learned Video Compression via Heterogeneous Deformable Compensation Network [78.72508633457392]
We propose a learned video compression framework via heterogeneous deformable compensation strategy (HDCVC) to tackle the problems of unstable compression performance. More specifically, the proposed algorithm extracts features from the two adjacent frames to estimate content-Neighborhood heterogeneous deformable (HetDeform) kernel offsets. Experimental results indicate that HDCVC achieves superior performance than the recent state-of-the-art learned video compression approaches.
arXiv Detail & Related papers (2022-07-11T02:31:31Z)
Perceptual Learned Video Compression with Recurrent Conditional GAN [158.0726042755]
We propose a Perceptual Learned Video Compression (PLVC) approach with recurrent conditional generative adversarial network. PLVC learns to compress video towards good perceptual quality at low bit-rate. The user study further validates the outstanding perceptual performance of PLVC in comparison with the latest learned video compression approaches.
arXiv Detail & Related papers (2021-09-07T13:36:57Z)
FVC: A New Framework towards Deep Video Compression in Feature Space [21.410266039564803]
We propose a feature-space video coding network (FVC) by performing all major operations (i.e., motion estimation, motion compression, motion compensation and residual compression) in the feature space. The proposed framework achieves the state-of-the-art performance on four benchmark datasets including HEVC, UVG, VTL and MCL-JCV.
arXiv Detail & Related papers (2021-05-20T08:55:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.