Related papers: ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

URL: http://arxiv.org/abs/2505.07652v1
Date: Mon, 12 May 2025 15:22:28 GMT
Title: ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models
Authors: Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M. Rehg, Tobias Hinz,
Abstract summary: Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot.<n>We propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation.<n>Our approach enables generation of multi-shot videos as a single video with full attention across all frames of all shots.
Score: 37.70850513700251
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot and lack the capability to generate multi-shot videos with discrete transitions where the same character performs distinct activities across the same or different backgrounds. To address this limitation we propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation. Our approach enables generation of multi-shot videos as a single video with full attention across all frames of all shots, ensuring character and background consistency, and allows users to control the number, duration, and content of shots through shot-specific conditioning. This is achieved by incorporating a transition token into the text-to-video model to control at which frames a new shot begins and a local attention masking strategy which controls the transition token's effect and allows shot-specific prompting. To obtain training data we propose a novel data collection pipeline to construct a multi-shot video dataset from existing single-shot video datasets. Extensive experiments demonstrate that fine-tuning a pre-trained text-to-video model for a few thousand iterations is enough for the model to subsequently be able to generate multi-shot videos with shot-specific control, outperforming the baselines. You can find more details in https://shotadapter.github.io/

Related papers

IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes [20.662082715151886]
We introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and instruction-based question-answering pairs tailored for multi-shot scenarios.<n>We then contribute a new model IPFormer-VideoLLM, which injection of instance-level features as instance prompts through an efficient attention-based connector.
arXiv Detail & Related papers (2025-06-26T09:30:57Z)
Long Context Tuning for Video Generation [63.060794860098795]
Long Context Tuning (LCT) is a training paradigm that expands the context window of pre-trained single-shot video diffusion models.<n>Our method expands full attention mechanisms from individual shots to encompass all shots within a scene.<n>Experiments demonstrate coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension.
arXiv Detail & Related papers (2025-03-13T17:40:07Z)
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations [82.94002870060045]
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects.<n>We develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance.<n>We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models.
arXiv Detail & Related papers (2025-01-13T19:17:06Z)
Video Diffusion Transformers are In-Context Learners [31.736838809714726]
This paper investigates a solution for enabling in-context capabilities of video diffusion transformers.<n>We propose a simple pipeline to leverage in-context generation: ($textbfii$) videos along spacial or time dimension.<n>Our framework presents a valuable tool for the research community and offers critical insights for advancing product-level controllable video generation systems.
arXiv Detail & Related papers (2024-12-14T10:39:55Z)
Mind the Time: Temporally-Controlled Multi-Event Video Generation [65.05423863685866]
We present MinT, a multi-event video generator with temporal control.<n>Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time.<n>For the first time in the literature, our model offers control over the timing of events in generated videos.
arXiv Detail & Related papers (2024-12-06T18:52:20Z)
VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines.<n>We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
VideoStudio: Generating Consistent-Content and Multi-Scene Videos [88.88118783892779]
VideoStudio is a framework for consistent-content and multi-scene video generation. VideoStudio leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script. VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.
arXiv Detail & Related papers (2024-01-02T15:56:48Z)
MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user. Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process. Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z)
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models [43.46536102838717]
VideoDreamer is a novel framework for customized multi-subject text-to-video generation.<n>It can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects.
arXiv Detail & Related papers (2023-11-02T04:38:50Z)
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning [62.51232333352754]
VideoDirectorGPT is a novel framework for consistent multi-scene video generation. Our proposed framework substantially improves layout and movement control in both single- and multi-scene video generation.
arXiv Detail & Related papers (2023-09-26T17:36:26Z)
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning [40.556222166309524]
We present SwinBERT, an end-to-end transformer-based model for video captioning. Our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input. Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames.
arXiv Detail & Related papers (2021-11-25T18:02:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.