Related papers: VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis

VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis

URL: http://arxiv.org/abs/2403.13501v1
Date: Wed, 20 Mar 2024 10:58:58 GMT
Title: VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
Authors: Yumeng Li, William Beluch, Margret Keuper, Dan Zhang, Anna Khoreva,
Abstract summary: We introduce the concept of Generative Temporal Nursing (GTN), where we aim to alter the generative process on the fly during inference to improve control over the temporal dynamics. We experimentally showcase the superiority of the proposed approach in generating longer, visually appealing videos over existing open-sourced T2V models.
Score: 18.806249040835624
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to synthesize quasi-static videos, ignoring the necessary visual change-over-time implied in the text prompt. At the same time, scaling these models to enable longer, more dynamic video synthesis often remains computationally intractable. To address this challenge, we introduce the concept of Generative Temporal Nursing (GTN), where we aim to alter the generative process on the fly during inference to improve control over the temporal dynamics and enable generation of longer videos. We propose a method for GTN, dubbed VSTAR, which consists of two key ingredients: 1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis based on the original single prompt leveraging LLMs, which gives accurate textual guidance to different visual states of longer videos, and 2) Temporal Attention Regularization (TAR) - a regularization technique to refine the temporal attention units of the pre-trained T2V diffusion models, which enables control over the video dynamics. We experimentally showcase the superiority of the proposed approach in generating longer, visually appealing videos over existing open-sourced T2V models. We additionally analyze the temporal attention maps realized with and without VSTAR, demonstrating the importance of applying our method to mitigate neglect of the desired visual change over time.

Related papers

T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models [67.13397169618624]
We introduce T2VAttack, a study of adversarial attacks on Text-to-Video (T2V) models from both semantic and temporal perspectives.<n>To achieve an effective and efficient attack process, we propose two adversarial attack methods: T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt.
arXiv Detail & Related papers (2025-12-30T03:00:46Z)
We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback [5.743225523680124]
Current text-to-video (T2V) generation models struggle to generate semantically and temporally consistent videos when dealing with longer, more complex prompts. We introduce NeuS-E, a novel zero-training video refinement pipeline that leverages neuro-symbolic feedback to automatically enhance video generation. Our approach first derives the neuro-symbolic feedback by analyzing a formal video representation and pinpoints semantically inconsistent events, objects, and their corresponding frames.
arXiv Detail & Related papers (2025-04-24T01:34:12Z)
SkyReels-V2: Infinite-length Film Generative Model [35.00453687783287]
We propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. We establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement.
arXiv Detail & Related papers (2025-04-17T16:37:27Z)
Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization [63.37161241355025]
Video-MSG is a training-free method for T2V generation based on Multimodal planning and Structured noise initialization. It guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models.
arXiv Detail & Related papers (2025-04-11T15:41:43Z)
BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way [72.1984861448374]
We present BroadWay, a training-free method to improve the quality of text-to-video generation without introducing additional parameters, augmenting memory or sampling time. Specifically, BroadWay is composed of two principal components: 1) Temporal Self-Guidance improves the structural plausibility and temporal consistency of generated videos by reducing the disparity between the temporal attention maps across various decoder blocks.
arXiv Detail & Related papers (2024-10-08T17:56:33Z)
FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance [3.6519202494141125]
We introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism.<n>CTGM incorporates the Temporal Information (TII) and Temporal Affinity Refiner (TAR) at the beginning and end of cross-attention.<n>Our approach achieves state-of-the-art T2V generation results on the EvalCrafter benchmark.
arXiv Detail & Related papers (2024-08-15T14:47:44Z)
VideoTetris: Towards Compositional Text-to-Video Generation [45.395598467837374]
VideoTetris is a framework that enables compositional T2V generation. We show that VideoTetris achieves impressive qualitative and quantitative results in T2V generation.
arXiv Detail & Related papers (2024-06-06T17:25:33Z)
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model. Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z)
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation [37.05422543076405]
Image-to-video (I2V) generation aims to use the initial frame (alongside a text prompt) to create a video sequence. Existing methods often struggle to preserve the integrity of the subject, background, and style from the first frame. We propose ConsistI2V, a diffusion-based method to enhance visual consistency for I2V generation.
arXiv Detail & Related papers (2024-02-06T19:08:18Z)
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V) We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z)
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling [85.60543452539076]
Existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. This study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. We propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models.
arXiv Detail & Related papers (2023-10-23T17:59:58Z)
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs [112.39389727164594]
Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches. While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the temporal dynamics modeling, one of the crux of video synthesis. In this work, we investigate strengthening awareness of video dynamics for DMs, for high-quality T2V generation
arXiv Detail & Related papers (2023-08-26T08:31:48Z)
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos. We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training. The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps. We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process. Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.