Related papers: SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls

SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls

URL: http://arxiv.org/abs/2602.23956v1
Date: Fri, 27 Feb 2026 11:59:06 GMT
Title: SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
Authors: Qianxun Xu, Chenxi Song, Yujun Cai, Chi Zhang,
Abstract summary: We present SwitchCraft, a training-free framework for multi-event video generation.<n>SwitchCraft steers frame-level attention to align with relevant event prompts.<n>Experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency.
Score: 21.71988638522276
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared with existing baselines, offering a simple yet effective solution for multi-event video generation.

Related papers

AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation [58.844504598618094]
We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation.<n>Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities.<n>We incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation.
arXiv Detail & Related papers (2025-12-11T18:59:34Z)
EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models [56.16721798968254]
We propose an event-guided, training-free framework for efficient understanding, named EventSTU.<n>In the temporal domain, we design a coarse-to-fine sampling algorithm that the change-triggered property of event cameras to eliminate redundant large frames.<n>In the spatial domain, we achieves an adaptive token pruning algorithm that leverages the saliency of events as a zero-cost prior to guide spatial reduction.
arXiv Detail & Related papers (2025-11-24T09:30:02Z)
When and Where do Events Switch in Multi-Event Video Generation? [59.9025263656304]
Existing methods that extend to multi-event generation omit an inspection of the intrinsic factor in event shifting.<n>This work introduces MEve, a self-curated prompt suite for evaluating multi-event text-to-video (T2V) generation.
arXiv Detail & Related papers (2025-10-03T14:31:56Z)
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding [48.12128042470839]
We propose an integrated Spatial-TempOral dynamic Prompting (STOP) model.<n>It consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting.<n>STOP consistently achieves superior performance against state-of-the-art methods.
arXiv Detail & Related papers (2025-03-20T09:16:20Z)
Text2Story: Advancing Video Storytelling with Text Guidance [19.901781116843942]
We introduce a novel storytelling framework that achieves this by integrating scene and action prompts through dynamics-inspired prompt mixing.<n>We propose a dynamics-informed prompt weighting mechanism that adaptively balances the influence of scene and action prompts at each diffusion timestep.<n>To further enhance motion continuity, we incorporate a semantic action representation to encode high-level action semantics into the blending process.
arXiv Detail & Related papers (2025-03-08T19:04:36Z)
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation [54.30327187663316]
DiTCtrl is a training-free multi-prompt video generation method under MM-DiT architectures for the first time.<n>We analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models.<n>Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts.
arXiv Detail & Related papers (2024-12-24T18:51:19Z)
Mind the Time: Temporally-Controlled Multi-Event Video Generation [65.05423863685866]
We present MinT, a multi-event video generator with temporal control.<n>Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time.<n>For the first time in the literature, our model offers control over the timing of events in generated videos.
arXiv Detail & Related papers (2024-12-06T18:52:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.