Controllable Video Generation with Provable Disentanglement
- URL: http://arxiv.org/abs/2502.02690v1
- Date: Tue, 04 Feb 2025 20:10:20 GMT
- Title: Controllable Video Generation with Provable Disentanglement
- Authors: Yifan Shen, Peiyuan Zhu, Zijian Li, Shaoan Xie, Zeyu Tang, Namrata Deka, Zongfang Liu, Guangyi Chen, Kun Zhang,
- Abstract summary: We propose Controllable Video Generative Adversarial Networks (VoGAN) to disentangle video concepts.
To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables.
Our method significantly improves generation quality and controllability across diverse real-world scenarios.
- Score: 15.139698184254469
- License:
- Abstract: Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose Controllable Video Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the minimal change principle, we first disentangle static and dynamic latent variables. We then leverage the sufficient change property to achieve component-wise identifiability of dynamic latent variables, enabling independent control over motion and identity. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a Temporal Transition Module to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.
Related papers
- Dynamic Concepts Personalization from Single Videos [92.62863918003575]
We introduce Set-and-Sequence, a novel framework for personalizing generative video models with dynamic concepts.
Our approach imposes a-temporal weight space within an architecture that does not explicitly separate spatial and temporal features.
Our framework embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality.
arXiv Detail & Related papers (2025-02-20T18:53:39Z) - Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss [35.69606926024434]
We propose a simple yet effective solution that combines an initial-noise-based approach with a novel motion consistency loss.
We then design a motion consistency loss to maintain similar feature correlation patterns in the generated video.
This approach improves temporal consistency across various motion control tasks while preserving the benefits of a training-free setup.
arXiv Detail & Related papers (2025-01-13T18:53:08Z) - ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning [40.70596166863986]
Multi-Concept Video Customization (MCVC) remains a significant challenge.
We introduce ConceptMaster, an innovative framework that effectively tackles the issues of identity decoupling while maintaining concept fidelity in customized videos.
Specifically, we introduce a novel strategy of learning decoupled multi-concept embeddings that are injected into the diffusion models in a standalone manner.
arXiv Detail & Related papers (2025-01-08T18:59:01Z) - InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.
Our key insight is that large video foundation models can act as both neurals and implicit physics simulators by learning interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z) - DiVE: DiT-based Video Generation with Enhanced Control [23.63288169762629]
We propose first DiT-based framework specifically designed for generating temporally and multi-view consistent videos.
Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency.
arXiv Detail & Related papers (2024-09-03T04:29:59Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.
We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.
We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control.
A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects.
generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z) - Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object
Video Generation [26.292052071093945]
We propose an unsupervised method to generate videos from a single frame and a sparse motion input.
Our trained model can generate unseen realistic object-to-object interactions.
We show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.
arXiv Detail & Related papers (2023-06-06T19:50:02Z) - Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time.
A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z) - Relational Self-Attention: What's Missing in Attention for Video
Understanding [52.38780998425556]
We introduce a relational feature transform, dubbed the relational self-attention (RSA)
Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts.
arXiv Detail & Related papers (2021-11-02T15:36:11Z) - NewtonianVAE: Proportional Control and Goal Identification from Pixels
via Physical Latent Spaces [9.711378389037812]
We introduce a latent dynamics learning framework that is uniquely designed to induce proportional controlability in the latent space.
We show that our learned dynamics model enables proportional control from pixels, dramatically simplifies and accelerates behavioural cloning of vision-based controllers, and provides interpretable goal discovery when applied to imitation learning of switching controllers from demonstration.
arXiv Detail & Related papers (2020-06-02T21:41:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.