Related papers: GlobalPaint: Spatiotemporal Coherent Video Outpainting with Global Feature Guidance

GlobalPaint: Spatiotemporal Coherent Video Outpainting with Global Feature Guidance

URL: http://arxiv.org/abs/2601.06413v1
Date: Sat, 10 Jan 2026 03:20:26 GMT
Title: GlobalPaint: Spatiotemporal Coherent Video Outpainting with Global Feature Guidance
Authors: Yueming Pan, Ruoyu Feng, Jianmin Bao, Chong Luo, Nanning Zheng,
Abstract summary: Video outpainting requires not only per-frame plausibility but also long-range temporal coherence.<n>We propose a coherent video outpainting framework for coherent video outpainting.
Score: 65.1747900492124
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video outpainting extends a video beyond its original boundaries by synthesizing missing border content. Compared with image outpainting, it requires not only per-frame spatial plausibility but also long-range temporal coherence, especially when outpainted content becomes visible across time under camera or object motion. We propose GlobalPaint, a diffusion-based framework for spatiotemporal coherent video outpainting. Our approach adopts a hierarchical pipeline that first outpaints key frames and then completes intermediate frames via an interpolation model conditioned on the completed boundaries, reducing error accumulation in sequential processing. At the model level, we augment a pretrained image inpainting backbone with (i) an Enhanced Spatial-Temporal module featuring 3D windowed attention for stronger spatiotemporal interaction, and (ii) global feature guidance that distills OpenCLIP features from observed regions across all frames into compact global tokens using a dedicated extractor. Comprehensive evaluations on benchmark datasets demonstrate improved reconstruction quality and more natural motion compared to prior methods. Our demo page is https://yuemingpan.github.io/GlobalPaint/

Related papers

VidSplice: Towards Coherent Video Inpainting via Explicit Spaced Frame Guidance [57.57195766748601]
VidSplice is a novel framework that guides inpainting process withtemporal cues.<n>We show that VidSplice achieves competitive performance across diverse video inpainting scenarios.
arXiv Detail & Related papers (2025-10-24T13:44:09Z)
VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning [38.89828994130979]
We introduce the task of arbitrary-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any location and akin to painting on a video canvas.<n>This flexible unifies many existing controllable video generation tasks--including first-frame image-to-video, the inpainting, extension, and cohesive--under a single paradigm.<n>We develop VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters.
arXiv Detail & Related papers (2025-10-09T17:58:59Z)
OutDreamer: Video Outpainting with a Diffusion Transformer [37.512451098188635]
We introduce OutDreamer, a DiT-based video outpainting framework.<n>We propose a mask-driven self-attention layer that dynamically integrates the given mask information.<n>For long video outpainting, we employ a cross-video-clip refiner to iteratively generate missing content.
arXiv Detail & Related papers (2025-06-27T15:08:54Z)
Hierarchical Masked 3D Diffusion Model for Video Outpainting [20.738731220322176]
We introduce a masked 3D diffusion model for video outpainting. This allows us to use multiple guide frames to connect the results of multiple video clip inferences. We also introduce a hybrid coarse-to-fine inference pipeline to alleviate the artifact accumulation problem.
arXiv Detail & Related papers (2023-09-05T10:52:21Z)
Unsupervised Video Interpolation by Learning Multilayered 2.5D Motion Fields [75.81417944207806]
This paper presents a self-supervised approach to video frame learning that requires only a single video. We parameterize the video motions by solving an ordinary differentiable equation (ODE) defined on a time-varying motion field. This implicit neural representation learns the video as a space-time continuum, allowing frame-time continuum at any temporal resolution.
arXiv Detail & Related papers (2022-04-21T06:17:05Z)
Occlusion-Aware Video Object Inpainting [72.38919601150175]
This paper presents occlusion-aware video object inpainting, which recovers both the complete shape and appearance for occluded objects in videos. Our technical contribution VOIN jointly performs video object shape completion and occluded texture generation. For more realistic results, VOIN is optimized using both T-PatchGAN and a newoc-temporal YouTube attention-based multi-class discriminator.
arXiv Detail & Related papers (2021-08-15T15:46:57Z)
Short-Term and Long-Term Context Aggregation Network for Video Inpainting [126.06302824297948]
Video inpainting aims to restore missing regions of a video and has many applications such as video editing and object removal. We present a novel context aggregation network to effectively exploit both short-term and long-term frame information for video inpainting. Experiments show that it outperforms state-of-the-art methods with better inpainting results and fast inpainting speed.
arXiv Detail & Related papers (2020-09-12T03:50:56Z)
Learning Joint Spatial-Temporal Transformations for Video Inpainting [58.939131620135235]
We propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting. We simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss.
arXiv Detail & Related papers (2020-07-20T16:35:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.