Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
- URL: http://arxiv.org/abs/2312.04483v1
- Date: Thu, 7 Dec 2023 17:59:07 GMT
- Title: Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
- Authors: Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya
Zhang, Changxin Gao, Nong Sang
- Abstract summary: Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V)
We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
- Score: 49.298187741014345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite diffusion models having shown powerful abilities to generate
photorealistic images, generating videos that are realistic and diverse still
remains in its infancy. One of the key reasons is that current methods
intertwine spatial content and temporal dynamics together, leading to a notably
increased complexity of text-to-video generation (T2V). In this work, we
propose HiGen, a diffusion model-based method that improves performance by
decoupling the spatial and temporal factors of videos from two perspectives,
i.e., structure level and content level. At the structure level, we decompose
the T2V task into two steps, including spatial reasoning and temporal
reasoning, using a unified denoiser. Specifically, we generate spatially
coherent priors using text during spatial reasoning and then generate
temporally coherent motions from these priors during temporal reasoning. At the
content level, we extract two subtle cues from the content of the input video
that can express motion and appearance changes, respectively. These two cues
then guide the model's training for generating videos, enabling flexible
content variations and enhancing temporal stability. Through the decoupled
paradigm, HiGen can effectively reduce the complexity of this task and generate
realistic videos with semantics accuracy and motion stability. Extensive
experiments demonstrate the superior performance of HiGen over the
state-of-the-art T2V methods.
Related papers
- BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way [72.1984861448374]
We present BroadWay, a training-free method to improve the quality of text-to-video generation without introducing additional parameters, augmenting memory or sampling time.
Specifically, BroadWay is composed of two principal components: 1) Temporal Self-Guidance improves the structural plausibility and temporal consistency of generated videos by reducing the disparity between the temporal attention maps across various decoder blocks.
arXiv Detail & Related papers (2024-10-08T17:56:33Z) - Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment [130.15775113897553]
Finsta is a fine-grained structural-temporal alignment learning method.
It consistently improves the existing 13 strong-tuning video-language models.
arXiv Detail & Related papers (2024-06-27T15:23:36Z) - VideoTetris: Towards Compositional Text-to-Video Generation [45.395598467837374]
VideoTetris is a framework that enables compositional T2V generation.
We show that VideoTetris achieves impressive qualitative and quantitative results in T2V generation.
arXiv Detail & Related papers (2024-06-06T17:25:33Z) - S2DM: Sector-Shaped Diffusion Models for Video Generation [2.0270353391739637]
We propose a novel Sector-Shaped Diffusion Model (S2DM) for video generation.
S2DM can generate a group of intrinsically related data sharing the same semantic and intrinsically related features.
We show that, without additional training, our model integrated with another temporal conditions generative model can still achieve comparable performance with existing works.
arXiv Detail & Related papers (2024-03-20T08:50:15Z) - Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z) - Decouple Content and Motion for Conditional Image-to-Video Generation [6.634105805557556]
conditional image-to-video (cI2V) generation is to create a believable new video by beginning with the condition, i.e., one image and text.
Previous cI2V generation methods conventionally perform in RGB pixel space, with limitations in modeling motion consistency and visual continuity.
We propose a novel approach by disentangling the target RGB pixels into two distinct components: spatial content and temporal motions.
arXiv Detail & Related papers (2023-11-24T06:08:27Z) - Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions.
We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.