Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
- URL: http://arxiv.org/abs/2308.13812v2
- Date: Tue, 19 Mar 2024 12:29:54 GMT
- Title: Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
- Authors: Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua,
- Abstract summary: Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches.
While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the temporal dynamics modeling, one of the crux of video synthesis.
In this work, we investigate strengthening awareness of video dynamics for DMs, for high-quality T2V generation
- Score: 112.39389727164594
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches. While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the intricate temporal dynamics modeling, one of the crux of video synthesis. In this work, we investigate strengthening the awareness of video dynamics for DMs, for high-quality T2V generation. Inspired by human intuition, we design an innovative dynamic scene manager (dubbed as Dysen) module, which includes (step-1) extracting from input text the key actions with proper time-order arrangement, (step-2) transforming the action schedules into the dynamic scene graph (DSG) representations, and (step-3) enriching the scenes in the DSG with sufficient and reasonable details. Taking advantage of the existing powerful LLMs (e.g., ChatGPT) via in-context learning, Dysen realizes (nearly) human-level temporal dynamics understanding. Finally, the resulting video DSG with rich action scene details is encoded as fine-grained spatio-temporal features, integrated into the backbone T2V DM for video generating. Experiments on popular T2V datasets suggest that our Dysen-VDM consistently outperforms prior arts with significant margins, especially in scenarios with complex actions. Codes at https://haofei.vip/Dysen-VDM
Related papers
- Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning [26.44634685830323]
We propose a novel framework called DEcomposed MOtion (DEMO) to enhance motion synthesis in Text-to-Video (T2V) generation.
Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms.
We demonstrate DEMO's superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality.
arXiv Detail & Related papers (2024-10-31T17:59:53Z) - Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment [130.15775113897553]
Finsta is a fine-grained structural-temporal alignment learning method.
It consistently improves the existing 13 strong-tuning video-language models.
arXiv Detail & Related papers (2024-06-27T15:23:36Z) - VideoTetris: Towards Compositional Text-to-Video Generation [45.395598467837374]
VideoTetris is a framework that enables compositional T2V generation.
We show that VideoTetris achieves impressive qualitative and quantitative results in T2V generation.
arXiv Detail & Related papers (2024-06-06T17:25:33Z) - VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis [18.806249040835624]
We introduce the concept of Generative Temporal Nursing (GTN), where we aim to alter the generative process on the fly during inference to improve control over the temporal dynamics.
We experimentally showcase the superiority of the proposed approach in generating longer, visually appealing videos over existing open-sourced T2V models.
arXiv Detail & Related papers (2024-03-20T10:58:58Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V)
We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z) - GD-VDM: Generated Depth for better Diffusion-based Video Generation [18.039417502897486]
This paper proposes GD-VDM, a novel diffusion model for video generation, demonstrating promising results.
We evaluated GD-VDM on the Cityscapes dataset and found that it generates more diverse and complex scenes compared to natural baselines.
arXiv Detail & Related papers (2023-06-19T21:32:10Z) - Align your Latents: High-Resolution Video Synthesis with Latent
Diffusion Models [71.11425812806431]
Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands.
Here, we apply the LDM paradigm to high-resolution generation, a particularly resource-intensive task.
We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling.
arXiv Detail & Related papers (2023-04-18T08:30:32Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.