DiViD: Disentangled Video Diffusion for Static-Dynamic Factorization
- URL: http://arxiv.org/abs/2507.13934v1
- Date: Fri, 18 Jul 2025 14:09:18 GMT
- Title: DiViD: Disentangled Video Diffusion for Static-Dynamic Factorization
- Authors: Marzieh Gheisari, Auguste Genovesio,
- Abstract summary: We introduce DiViD, the first end-to-end video diffusion framework for explicit static-dynamic factorization.<n>DiViD extracts a global static token from the first frame and per-frame dynamic tokens, explicitly removing static content from the motion code.<n>We evaluate DiViD on real-world benchmarks using swap-based accuracy and cross-leakage metrics.
- Score: 2.7194314957925094
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unsupervised disentanglement of static appearance and dynamic motion in video remains a fundamental challenge, often hindered by information leakage and blurry reconstructions in existing VAE- and GAN-based approaches. We introduce DiViD, the first end-to-end video diffusion framework for explicit static-dynamic factorization. DiViD's sequence encoder extracts a global static token from the first frame and per-frame dynamic tokens, explicitly removing static content from the motion code. Its conditional DDPM decoder incorporates three key inductive biases: a shared-noise schedule for temporal consistency, a time-varying KL-based bottleneck that tightens at early timesteps (compressing static information) and relaxes later (enriching dynamics), and cross-attention that routes the global static token to all frames while keeping dynamic tokens frame-specific. An orthogonality regularizer further prevents residual static-dynamic leakage. We evaluate DiViD on real-world benchmarks using swap-based accuracy and cross-leakage metrics. DiViD outperforms state-of-the-art sequential disentanglement methods: it achieves the highest swap-based joint accuracy, preserves static fidelity while improving dynamic transfer, and reduces average cross-leakage.
Related papers
- HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene [11.906835503107189]
We propose HAIF-GS, a unified framework that enables structured and consistent dynamic modeling through sparse anchor-driven deformation.<n>We show that HAIF-GS significantly outperforms prior dynamic 3DGS methods in rendering quality, temporal coherence, and reconstruction efficiency.
arXiv Detail & Related papers (2025-06-11T08:45:08Z) - Motion-Aware Concept Alignment for Consistent Video Editing [57.08108545219043]
We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video.<n>Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video.<n>We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames
arXiv Detail & Related papers (2025-06-01T13:28:04Z) - StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning [51.003833566279006]
Class-Incremental Learning (CIL) seeks to develop models that continuously learn new action categories over time without previously acquired knowledge.<n>Existing approaches either rely on forgetting, raising concerns over memory and privacy, or adapt static image-based methods that neglect temporal modeling.<n>We propose a unified and exemplar-free VCIL framework that explicitly disentangles and preserves information.
arXiv Detail & Related papers (2025-05-20T06:46:51Z) - STATIC : Surface Temporal Affine for TIme Consistency in Video Monocular Depth Estimation [14.635179908525389]
Video monocular depth estimation is essential for applications such as autonomous driving, AR/VR, and robotics.<n>Recent transformer-based single-image monocular depth estimation models perform well on single images but struggle with depth consistency across video frames.<n>We propose STATIC, a novel model that learns temporal consistency in static and dynamic area without additional information.
arXiv Detail & Related papers (2024-12-02T03:53:33Z) - Event-boosted Deformable 3D Gaussians for Dynamic Scene Reconstruction [50.873820265165975]
We introduce the first approach combining event cameras, which capture high-temporal-resolution, continuous motion data, with deformable 3D-GS for dynamic scene reconstruction.<n>We propose a GS-Threshold Joint Modeling strategy, creating a mutually reinforcing process that greatly improves both 3D reconstruction and threshold modeling.<n>We contribute the first event-inclusive 4D benchmark with synthetic and real-world dynamic scenes, on which our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-11-25T08:23:38Z) - DATAP-SfM: Dynamic-Aware Tracking Any Point for Robust Structure from Motion in the Wild [85.03973683867797]
This paper proposes a concise, elegant, and robust pipeline to estimate smooth camera trajectories and obtain dense point clouds for casual videos in the wild.
We show that the proposed method achieves state-of-the-art performance in terms of camera pose estimation even in complex dynamic challenge scenes.
arXiv Detail & Related papers (2024-11-20T13:01:16Z) - DualAD: Disentangling the Dynamic and Static World for End-to-End Driving [11.379456277711379]
State-of-the-art approaches for autonomous driving integrate multiple sub-tasks of the overall driving task into a single pipeline.
We propose dedicated representations to disentangle dynamic agents and static scene elements.
Our method titled DualAD outperforms independently trained single-task networks.
arXiv Detail & Related papers (2024-06-10T13:46:07Z) - Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition [52.89441679581216]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise.<n>We present an innovative video decomposition strategy that incorporates view-independent and view-dependent components.<n>Our framework consistently outperforms existing methods, establishing a new SOTA performance.
arXiv Detail & Related papers (2024-05-24T15:56:40Z) - Modelling Latent Dynamics of StyleGAN using Neural ODEs [52.03496093312985]
We learn the trajectory of independently inverted latent codes from GANs.
The learned continuous trajectory allows us to perform infinite frame and consistent video manipulation.
Our method achieves state-of-the-art performance but with much less computation.
arXiv Detail & Related papers (2022-08-23T21:20:38Z) - STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic
Cross-Modal Understanding [68.96574451918458]
We propose a framework named STVG, which models visual-linguistic dependencies with a static branch and a dynamic branch.
Both the static and dynamic branches are designed as cross-modal transformers.
Our proposed method achieved 39.6% vIoU and won the first place in the HC-STVG of the Person in Context Challenge.
arXiv Detail & Related papers (2022-07-06T15:48:58Z) - Towards Robust Unsupervised Disentanglement of Sequential Data -- A Case
Study Using Music Audio [17.214062755082065]
Disentangled sequential autoencoders (DSAEs) represent a class of probabilistic graphical models.
We show that the vanilla DSAE suffers from being sensitive to the choice of model architecture and capacity of the dynamic latent variables.
We propose TS-DSAE, a two-stage training framework that first learns sequence-level prior distributions.
arXiv Detail & Related papers (2022-05-12T04:11:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.