Related papers: DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation

DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation

URL: http://arxiv.org/abs/2511.14530v1
Date: Tue, 18 Nov 2025 14:34:20 GMT
Title: DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation
Authors: Xiangchen Yin, Jiahui Yuan, Zhangchi Hu, Wenzhang Sun, Jie Chen, Xiaozhen Qiao, Hao Li, Xiaoyan Sun,
Abstract summary: We propose decoupled VAE (Co-VAE) to achieve compact latent representation.<n>We design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain consistency during reconstruction.
Score: 14.242798717551471
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain spatiotemporal consistency during reconstruction. We further utilize a decoupled adaptation strategy that freezes partial encoders while training the others sequentially, ensuring stable training and accurate learning of both static and dynamic features. Extensive quantitative and qualitative experiments demonstrate that DeCo-VAE achieves superior video reconstruction performance.

Related papers

DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation [77.89090846233906]
We propose a system-level framework, termed the Divide-and-Conquer Diffusion Model (DCDM)<n>DCDM decomposes video consistency modeling into three dedicated components while sharing a unified video generation backbone.<n>We validate our framework on the test set of the CVM Competition at AAAI'26, and the results demonstrate that the proposed strategies effectively address these challenges.
arXiv Detail & Related papers (2026-02-14T07:02:36Z)
Autoregressive Video Autoencoder with Decoupled Temporal and Spatial Context [8.458436768725212]
Video autoencoders compress videos into compact latent representations for efficient reconstruction.<n>We propose Autoregressive Video Autoencoder (ARVAE), which compresses and reconstructs each frame conditioned on its predecessor in an autoregressive manner.<n>ARVAE achieves superior reconstruction quality with extremely lightweight models and small-scale training data.
arXiv Detail & Related papers (2025-12-12T05:40:01Z)
Conditional Video Generation for High-Efficiency Video Compression [48.32125957038998]
We propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction.<n>Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals.
arXiv Detail & Related papers (2025-07-21T06:16:27Z)
REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder [52.698595889988766]
We present a novel perspective on learning video embedders for generative modeling.<n>Rather than requiring an exact reproduction of an input video, an effective embedder should focus on visually plausible reconstructions.<n>We propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework.
arXiv Detail & Related papers (2025-03-11T17:51:07Z)
DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding [18.312501339046296]
We observe that redundancy occurs in both repeated and answer-irrelevant frames, and the corresponding frames vary with different questions.<n>This suggests the possibility of adopting dynamic encoding to balance detailed video information preservation with token budget reduction.
arXiv Detail & Related papers (2024-11-19T09:16:54Z)
Improved Video VAE for Latent Video Diffusion Model [55.818110540710215]
Video Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora. Most of existing VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression. We propose a new KTC architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE)
arXiv Detail & Related papers (2024-11-10T12:43:38Z)
Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition [52.89441679581216]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise.<n>We present an innovative video decomposition strategy that incorporates view-independent and view-dependent components.<n>Our framework consistently outperforms existing methods, establishing a new SOTA performance.
arXiv Detail & Related papers (2024-05-24T15:56:40Z)
RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies. Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks. Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z)
VNVC: A Versatile Neural Video Coding Framework for Efficient Human-Machine Vision [59.632286735304156]
It is more efficient to enhance/analyze the coded representations directly without decoding them into pixels. We propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis.
arXiv Detail & Related papers (2023-06-19T03:04:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.