Related papers: Enhancing Scene Transition Awareness in Video Generation via Post-Training

Enhancing Scene Transition Awareness in Video Generation via Post-Training

URL: http://arxiv.org/abs/2507.18046v1
Date: Thu, 24 Jul 2025 02:50:26 GMT
Title: Enhancing Scene Transition Awareness in Video Generation via Post-Training
Authors: Hanwen Shen, Jiajie Lu, Yupeng Cao, Xiaonan Yang,
Abstract summary: We propose the textbfTransition-Aware Video dataset, which consists of preprocessed video clips with multiple scene transitions.<n>Our experiment shows that post-training on the textbfTAV dataset improves prompt-based scene transition understanding, narrows the gap between required and generated scenes, and maintains image quality.
Score: 0.4199844472131921
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in AI-generated video have shown strong performance on \emph{text-to-video} tasks, particularly for short clips depicting a single scene. However, current models struggle to generate longer videos with coherent scene transitions, primarily because they cannot infer when a transition is needed from the prompt. Most open-source models are trained on datasets consisting of single-scene video clips, which limits their capacity to learn and respond to prompts requiring multiple scenes. Developing scene transition awareness is essential for multi-scene generation, as it allows models to identify and segment videos into distinct clips by accurately detecting transitions. To address this, we propose the \textbf{Transition-Aware Video} (TAV) dataset, which consists of preprocessed video clips with multiple scene transitions. Our experiment shows that post-training on the \textbf{TAV} dataset improves prompt-based scene transition understanding, narrows the gap between required and generated scenes, and maintains image quality.

Related papers

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding [17.769963004697047]
We propose a human-inspired automatic video editing framework (HIVE)<n>Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models.<n>Our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks.
arXiv Detail & Related papers (2025-07-03T16:54:32Z)
Long Context Tuning for Video Generation [63.060794860098795]
Long Context Tuning (LCT) is a training paradigm that expands the context window of pre-trained single-shot video diffusion models.<n>Our method expands full attention mechanisms from individual shots to encompass all shots within a scene.<n>Experiments demonstrate coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension.
arXiv Detail & Related papers (2025-03-13T17:40:07Z)
Contrastive Sequential-Diffusion Learning: Non-linear and Multi-Scene Instructional Video Synthesis [9.687215124767063]
We propose a contrastive sequential video diffusion method that selects the most suitable previously generated scene to guide and condition the denoising process of the next scene.<n>Experiments with action-centered data from the real world demonstrate the practicality and improved consistency of our model compared to previous work.
arXiv Detail & Related papers (2024-07-16T15:03:05Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
VideoStudio: Generating Consistent-Content and Multi-Scene Videos [88.88118783892779]
VideoStudio is a framework for consistent-content and multi-scene video generation. VideoStudio leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script. VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.
arXiv Detail & Related papers (2024-01-02T15:56:48Z)
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z)
HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z)
AutoTransition: Learning to Recommend Video Transition Effects [20.384463765702417]
We present the premier work on performing automatic video transitions recommendation (VTR) VTR is given a sequence of raw video shots and companion audio, recommend video transitions for each pair of neighboring shots. We propose a novel multi-modal matching framework which consists of two parts.
arXiv Detail & Related papers (2022-07-27T12:00:42Z)
Scene Consistency Representation Learning for Video Scene Segmentation [26.790491577584366]
We propose an effective Self-Supervised Learning (SSL) framework to learn better shot representations from long-term videos. We present an SSL scheme to achieve scene consistency, while exploring considerable data augmentation and shuffling methods to boost the model generalizability. Our method achieves the state-of-the-art performance on the task of Video Scene.
arXiv Detail & Related papers (2022-05-11T13:31:15Z)
Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration. This enables the learning of long-range dependencies beyond a single clip. Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.