Related papers: TempoControl: Temporal Attention Guidance for Text-to-Video Models

TempoControl: Temporal Attention Guidance for Text-to-Video Models

URL: http://arxiv.org/abs/2510.02226v1
Date: Thu, 02 Oct 2025 17:13:35 GMT
Title: TempoControl: Temporal Attention Guidance for Text-to-Video Models
Authors: Shira Schiber, Ofir Lindenbaum, Idan Schwartz,
Abstract summary: We introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference.<n>Our method steers attention using three complementary principles: aligning its temporal shape with a control signal, amplifying it where visibility is needed, and maintaining spatial focus.<n>We demonstrate its effectiveness across various video generation applications, including temporal reordering for single and multiple objects, as well as action and audio-aligned generation.
Score: 18.49685485536669
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal shape with a control signal (via correlation), amplifying it where visibility is needed (via energy), and maintaining spatial focus (via entropy). TempoControl allows precise control over timing while ensuring high video quality and diversity. We demonstrate its effectiveness across various video generation applications, including temporal reordering for single and multiple objects, as well as action and audio-aligned generation.

Related papers

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer [38.324957777123664]
We propose Omni-temporal framework for unified video transfer.<n>It leverages multi-view information across video frames to enhance appearance consistency.<n>It exploits temporal cues to enable fine-grained temporal control.
arXiv Detail & Related papers (2026-01-20T18:58:11Z)
AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation [58.844504598618094]
We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation.<n>Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities.<n>We incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation.
arXiv Detail & Related papers (2025-12-11T18:59:34Z)
Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z)
Mind the Time: Temporally-Controlled Multi-Event Video Generation [65.05423863685866]
We present MinT, a multi-event video generator with temporal control.<n>Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time.<n>For the first time in the literature, our model offers control over the timing of events in generated videos.
arXiv Detail & Related papers (2024-12-06T18:52:20Z)
Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention [61.3281618482513]
We present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos.<n>CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the dimensions.<n>CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos.
arXiv Detail & Related papers (2024-12-04T18:02:49Z)
ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way [72.1984861448374]
ByTheWay is a training-free method to improve the quality of text-to-video generation without introducing additional parameters, augmenting memory or sampling time.<n>It improves structural plausibility and temporal consistency of generated videos by reducing the disparity between the temporal attention maps across various decoder blocks.<n>It enhances the magnitude and richness of motion by amplifying the energy of the map.
arXiv Detail & Related papers (2024-10-08T17:56:33Z)
DiVE: DiT-based Video Generation with Enhanced Control [23.63288169762629]
We propose first DiT-based framework specifically designed for generating temporally and multi-view consistent videos. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency.
arXiv Detail & Related papers (2024-09-03T04:29:59Z)
VideoComposer: Compositional Video Synthesis with Motion Controllability [52.4714732331632]
VideoComposer allows users to flexibly compose a video with textual conditions, spatial conditions, and more importantly temporal conditions. We introduce the motion vector from compressed videos as an explicit control signal to provide guidance regarding temporal dynamics. In addition, we develop a Spatio-Temporal Condition encoder (STC-encoder) that serves as a unified interface to effectively incorporate the spatial and temporal relations of sequential inputs.
arXiv Detail & Related papers (2023-06-03T06:29:02Z)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps. We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process. Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.