Related papers: VidSplice: Towards Coherent Video Inpainting via Explicit Spaced Frame Guidance

VidSplice: Towards Coherent Video Inpainting via Explicit Spaced Frame Guidance

URL: http://arxiv.org/abs/2510.21461v1
Date: Fri, 24 Oct 2025 13:44:09 GMT
Title: VidSplice: Towards Coherent Video Inpainting via Explicit Spaced Frame Guidance
Authors: Ming Xie, Junqiu Yu, Qiaole Dong, Xiangyang Xue, Yanwei Fu,
Abstract summary: VidSplice is a novel framework that guides inpainting process withtemporal cues.<n>We show that VidSplice achieves competitive performance across diverse video inpainting scenarios.
Score: 57.57195766748601
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent video inpainting methods often employ image-to-video (I2V) priors to model temporal consistency across masked frames. While effective in moderate cases, these methods struggle under severe content degradation and tend to overlook spatiotemporal stability, resulting in insufficient control over the latter parts of the video. To address these limitations, we decouple video inpainting into two sub-tasks: multi-frame consistent image inpainting and masked area motion propagation. We propose VidSplice, a novel framework that introduces spaced-frame priors to guide the inpainting process with spatiotemporal cues. To enhance spatial coherence, we design a CoSpliced Module to perform first-frame propagation strategy that diffuses the initial frame content into subsequent reference frames through a splicing mechanism. Additionally, we introduce a delicate context controller module that encodes coherent priors after frame duplication and injects the spliced video into the I2V generative backbone, effectively constraining content distortion during generation. Extensive evaluations demonstrate that VidSplice achieves competitive performance across diverse video inpainting scenarios. Moreover, its design significantly improves both foreground alignment and motion stability, outperforming existing approaches.

Related papers

Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence [81.82643953694485]
We present FRESCO, which integrates intra-frame correspondence with inter-frame correspondence to formulate a more robust spatial-temporal constraint.<n>Our method goes beyond attention guidance to explicitly optimize features, achieving high spatial-temporal consistency with the input video.<n>We verify FRESCO adaptations on two zero-shot tasks of video-to-video translation and text-guided video editing.
arXiv Detail & Related papers (2025-12-03T15:51:11Z)
VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning [38.89828994130979]
We introduce the task of arbitrary-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any location and akin to painting on a video canvas.<n>This flexible unifies many existing controllable video generation tasks--including first-frame image-to-video, the inpainting, extension, and cohesive--under a single paradigm.<n>We develop VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters.
arXiv Detail & Related papers (2025-10-09T17:58:59Z)
OutDreamer: Video Outpainting with a Diffusion Transformer [37.512451098188635]
We introduce OutDreamer, a DiT-based video outpainting framework.<n>We propose a mask-driven self-attention layer that dynamically integrates the given mask information.<n>For long video outpainting, we employ a cross-video-clip refiner to iteratively generate missing content.
arXiv Detail & Related papers (2025-06-27T15:08:54Z)
Semantically Consistent Video Inpainting with Conditional Diffusion Models [16.42354856518832]
We present a framework for solving problems with conditional video diffusion models. We introduce inpainting-specific sampling schemes which capture crucial long-range dependencies in the context. We devise a novel method for conditioning on the known pixels in incomplete frames.
arXiv Detail & Related papers (2024-04-30T23:49:26Z)
MoVideo: Motion-Aware Video Generation with Diffusion Models [97.03352319694795]
We propose a novel motion-aware generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow. MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.
arXiv Detail & Related papers (2023-11-19T13:36:03Z)
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos. Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps. We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process. Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation [50.49396123016185]
Video frame (VFI) aims to synthesize an intermediate frame between two consecutive frames. We propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI) Our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks.
arXiv Detail & Related papers (2022-07-19T03:37:49Z)
Deep Sketch-guided Cartoon Video Inbetweening [24.00033622396297]
We propose a framework to produce cartoon videos by fetching the color information from two inputs while following the animated motion guided by a user sketch. By explicitly considering the correspondence between frames and the sketch, we can achieve higher quality results than other image synthesis methods.
arXiv Detail & Related papers (2020-08-10T14:22:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.