OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model
- URL: http://arxiv.org/abs/2409.01199v2
- Date: Mon, 9 Sep 2024 13:49:53 GMT
- Title: OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model
- Authors: Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinhua Cheng, Li Yuan,
- Abstract summary: Variational Autoencoder (VAE) compressing videos into latent representations is a crucial component of Latent Video Diffusion Models (LVDMs)
Most LVDMs utilize 2D image VAE, whose compression for videos is only in the spatial dimension and often ignored in the temporal dimension.
We propose an omni-dimension compression VAE, named OD-VAE, which can temporally and spatially compress videos.
- Score: 33.766339921655025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Variational Autoencoder (VAE), compressing videos into latent representations, is a crucial preceding component of Latent Video Diffusion Models (LVDMs). With the same reconstruction quality, the more sufficient the VAE's compression for videos is, the more efficient the LVDMs are. However, most LVDMs utilize 2D image VAE, whose compression for videos is only in the spatial dimension and often ignored in the temporal dimension. How to conduct temporal compression for videos in a VAE to obtain more concise latent representations while promising accurate reconstruction is seldom explored. To fill this gap, we propose an omni-dimension compression VAE, named OD-VAE, which can temporally and spatially compress videos. Although OD-VAE's more sufficient compression brings a great challenge to video reconstruction, it can still achieve high reconstructed accuracy by our fine design. To obtain a better trade-off between video reconstruction quality and compression speed, four variants of OD-VAE are introduced and analyzed. In addition, a novel tail initialization is designed to train OD-VAE more efficiently, and a novel inference strategy is proposed to enable OD-VAE to handle videos of arbitrary length with limited GPU memory. Comprehensive experiments on video reconstruction and LVDM-based video generation demonstrate the effectiveness and efficiency of our proposed methods.
Related papers
- SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration [73.70209718408641]
SeedVR is a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution.
It achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos.
arXiv Detail & Related papers (2025-01-02T16:19:48Z) - Large Motion Video Autoencoding with Cross-modal Video VAE [52.13379965800485]
Video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation.
Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance.
We present a novel and powerful video autoencoder capable of high-fidelity video encoding.
arXiv Detail & Related papers (2024-12-23T18:58:24Z) - VRVVC: Variable-Rate NeRF-Based Volumetric Video Compression [59.14355576912495]
NeRF-based video has revolutionized visual media by delivering photorealistic Free-Viewpoint Video (FVV) experiences.
The substantial data volumes pose significant challenges for storage and transmission.
We propose VRVVC, a novel end-to-end joint variable-rate framework for video compression.
arXiv Detail & Related papers (2024-12-16T01:28:04Z) - WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model [15.171544722138806]
Video Variational Autoencoder (VAE) encodes videos into a low-dimensional latent space.
VAE is a key component of most Latent Video Diffusion Models (LVDMs)
arXiv Detail & Related papers (2024-11-26T14:23:53Z) - Improved Video VAE for Latent Video Diffusion Model [55.818110540710215]
Video Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora.
Most of existing VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression.
We propose a new KTC architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE)
arXiv Detail & Related papers (2024-11-10T12:43:38Z) - CV-VAE: A Compatible Video VAE for Latent Generative Video Models [45.702473834294146]
Variationalencoders (VAE) plays a crucial role in OpenAI's Auto-temporal compression of videos.
Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models.
We propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE.
arXiv Detail & Related papers (2024-05-30T17:33:10Z) - Towards Scalable Neural Representation for Diverse Videos [68.73612099741956]
Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images.
Existing INR-based methods are limited to encoding a handful of short videos with redundant visual content.
This paper focuses on developing neural representations for encoding long and/or a large number of videos with diverse visual content.
arXiv Detail & Related papers (2023-03-24T16:32:19Z) - Learned Video Compression via Heterogeneous Deformable Compensation
Network [78.72508633457392]
We propose a learned video compression framework via heterogeneous deformable compensation strategy (HDCVC) to tackle the problems of unstable compression performance.
More specifically, the proposed algorithm extracts features from the two adjacent frames to estimate content-Neighborhood heterogeneous deformable (HetDeform) kernel offsets.
Experimental results indicate that HDCVC achieves superior performance than the recent state-of-the-art learned video compression approaches.
arXiv Detail & Related papers (2022-07-11T02:31:31Z) - Foveation-based Deep Video Compression without Motion Search [43.70396515286677]
Foveation protocols are desirable since only a small portion of a video viewed in VR may be visible as a user gazes in any given direction.
We implement foveation by introducing a Foveation Generator Unit (FGU) that generates foveation masks which direct the allocation of bits.
Our new compression model, which we call the Foveated MOtionless VIdeo Codec (Foveated MOVI-Codec), is able to efficiently compress videos without computing motion.
arXiv Detail & Related papers (2022-03-30T17:30:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.