SAGE: Structure-Aware Generative Video Transitions between Diverse Clips
- URL: http://arxiv.org/abs/2510.24667v1
- Date: Tue, 28 Oct 2025 17:35:02 GMT
- Title: SAGE: Structure-Aware Generative Video Transitions between Diverse Clips
- Authors: Mia Kan, Yilin Liu, Niloy Mitra,
- Abstract summary: Generative vidEo transitions produce smooth, semantically consistent transitions without fine-tuning.<n>SAGE (Structure-Aware Generative vidEo transitions) is a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis.
- Score: 7.501790515877048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.
Related papers
- STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative [55.05324155854762]
We introduce a SToryboard-Anchored GEneration workflow to reformulate the STAGE-based video generation task.<n>Instead of using sparses, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot.<n>We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained narratives for story progression, cinematic attributes, and human preferences.
arXiv Detail & Related papers (2025-12-13T15:57:29Z) - Beyond Boundary Frames: Audio-Visual Semantic Guidance for Context-Aware Video Interpolation [14.00347197658315]
BBF is a context-aware video frame framework guided by audio/visual semantics.<n>We show that BBF outperforms specialized state-of-the-art methods on both generic and audio-visual synchronized tasks.
arXiv Detail & Related papers (2025-12-03T09:22:13Z) - LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z) - SOYO: A Tuning-Free Approach for Video Style Morphing via Style-Adaptive Interpolation in Diffusion Models [54.641809532055916]
We introduce SOYO, a novel diffusion-based framework for video style morphing.<n>Our method employs a pre-trained text-to-image diffusion model without fine-tuning, combining attention injection and AdaIN to preserve structural consistency.<n>To harmonize across video frames, we propose a novel adaptive sampling scheduler between two style images.
arXiv Detail & Related papers (2025-03-10T07:27:01Z) - Text2Story: Advancing Video Storytelling with Text Guidance [19.901781116843942]
We introduce a novel storytelling framework that achieves this by integrating scene and action prompts through dynamics-inspired prompt mixing.<n>We propose a dynamics-informed prompt weighting mechanism that adaptively balances the influence of scene and action prompts at each diffusion timestep.<n>To further enhance motion continuity, we incorporate a semantic action representation to encode high-level action semantics into the blending process.
arXiv Detail & Related papers (2025-03-08T19:04:36Z) - RepVideo: Rethinking Cross-Layer Representation for Video Generation [53.701548524818534]
We propose RepVideo, an enhanced representation framework for text-to-video diffusion models.<n>By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information.<n>Our experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, but also improves temporal consistency in video generation.
arXiv Detail & Related papers (2025-01-15T18:20:37Z) - Generative Inbetweening through Frame-wise Conditions-Driven Video Generation [63.43583844248389]
generative inbetweening aims to generate intermediate frame sequences by utilizing two key frames as input.<n>We propose a Frame-wise Conditions-driven Video Generation (FCVG) method that significantly enhances the temporal stability of interpolated video frames.<n>Our FCVG demonstrates the capability to generate temporally stable videos using both linear and non-linear curves.
arXiv Detail & Related papers (2024-12-16T13:19:41Z) - Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation [45.214169930573775]
We propose a conditional diffusion model to synthesize contextually smooth transition frames.
Our approach transforms the unsupervised problem of transition frame generation into a supervised training task.
Experiments on the PHO14TENIX, USTC-CSL100, and USTC-500 datasets demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-11-25T15:06:49Z) - TVG: A Training-free Transition Video Generation Method with Diffusion Models [12.037716102326993]
Transition videos play a crucial role in media production, enhancing the flow and coherence of visual narratives.
Recent advances in diffusion model-based video generation offer new possibilities for creating transitions but face challenges such as poor inter-frame relationship modeling and abrupt content changes.
We propose a novel training-free Transition Video Generation (TVG) approach using video-level diffusion models that addresses these limitations without additional training.
arXiv Detail & Related papers (2024-08-24T00:33:14Z) - MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling [19.004339956475498]
MAVIN is designed to generate transition videos that seamlessly connect two given videos, forming a cohesive integrated sequence.
We introduce a new metric, CLIP-RS (CLIP Relative Smoothness), to evaluate temporal coherence and smoothness, complementing traditional quality-based metrics.
Experimental results on horse and tiger scenarios demonstrate MAVIN's superior performance in generating smooth and coherent video transitions.
arXiv Detail & Related papers (2024-05-28T09:46:09Z) - Training-Free Semantic Video Composition via Pre-trained Diffusion Model [96.0168609879295]
Current approaches, predominantly trained on videos with adjusted foreground color and lighting, struggle to address deep semantic disparities beyond superficial adjustments.
We propose a training-free pipeline employing a pre-trained diffusion model imbued with semantic prior knowledge.
Experimental results reveal that our pipeline successfully ensures the visual harmony and inter-frame coherence of the outputs.
arXiv Detail & Related papers (2024-01-17T13:07:22Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.