Related papers: Unified Long Video Inpainting and Outpainting via Overlapping High-Order Co-Denoising

Unified Long Video Inpainting and Outpainting via Overlapping High-Order Co-Denoising

URL: http://arxiv.org/abs/2511.03272v1
Date: Wed, 05 Nov 2025 08:09:03 GMT
Title: Unified Long Video Inpainting and Outpainting via Overlapping High-Order Co-Denoising
Authors: Shuangquan Lyu, Steven Mao, Yue Ma,
Abstract summary: We introduce a novel and unified approach for long video inpainting and outpainting that extends text-to-video diffusion models.<n>Our method leverages LoRA to efficiently fine-tune a large pre-trained video diffusion model like Alibaba's Wan 2.1 for masked region video synthesis.<n>In contrast to prior work that struggles with fixed-length clips or exhibits stitching artifacts, our system enables arbitrarily long video generation and editing without noticeable seams or drift.
Score: 3.6045678816599387
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating long videos remains a fundamental challenge, and achieving high controllability in video inpainting and outpainting is particularly demanding. To address both of these challenges simultaneously and achieve controllable video inpainting and outpainting for long video clips, we introduce a novel and unified approach for long video inpainting and outpainting that extends text-to-video diffusion models to generate arbitrarily long, spatially edited videos with high fidelity. Our method leverages LoRA to efficiently fine-tune a large pre-trained video diffusion model like Alibaba's Wan 2.1 for masked region video synthesis, and employs an overlap-and-blend temporal co-denoising strategy with high-order solvers to maintain consistency across long sequences. In contrast to prior work that struggles with fixed-length clips or exhibits stitching artifacts, our system enables arbitrarily long video generation and editing without noticeable seams or drift. We validate our approach on challenging inpainting/outpainting tasks including editing or adding objects over hundreds of frames and demonstrate superior performance to baseline methods like Wan 2.1 model and VACE in terms of quality (PSNR/SSIM), and perceptual realism (LPIPS). Our method enables practical long-range video editing with minimal overhead, achieved a balance between parameter efficient and superior performance.

Related papers

VidSplice: Towards Coherent Video Inpainting via Explicit Spaced Frame Guidance [57.57195766748601]
VidSplice is a novel framework that guides inpainting process withtemporal cues.<n>We show that VidSplice achieves competitive performance across diverse video inpainting scenarios.
arXiv Detail & Related papers (2025-10-24T13:44:09Z)
OutDreamer: Video Outpainting with a Diffusion Transformer [37.512451098188635]
We introduce OutDreamer, a DiT-based video outpainting framework.<n>We propose a mask-driven self-attention layer that dynamically integrates the given mask information.<n>For long video outpainting, we employ a cross-video-clip refiner to iteratively generate missing content.
arXiv Detail & Related papers (2025-06-27T15:08:54Z)
VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control [47.34885131252508]
Video inpainting aims to restore corrupted video content.<n>We propose a novel dual-stream paradigm VideoPainter to process masked videos.<n>We also introduce a novel target region ID resampling technique that enables any-length video inpainting.
arXiv Detail & Related papers (2025-03-07T17:59:46Z)
UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts [20.955898491009652]
UniPaint is a generative space-time video inpainting framework that enables spatial-temporal inpainting and synthesis.<n>UniPaint produces high-quality and aesthetically pleasing results, achieving the best quantitative results across various tasks and scale setups.
arXiv Detail & Related papers (2024-12-09T09:45:14Z)
FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention [57.651429116402554]
This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model for consistent long video generation. We find that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation. Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process.
arXiv Detail & Related papers (2024-07-29T11:52:07Z)
Raformer: Redundancy-Aware Transformer for Video Wire Inpainting [77.41727407673066]
Video Wire Inpainting (VWI) is a prominent application in video inpainting, aimed at flawlessly removing wires in films or TV series. Wire removal poses greater challenges due to the wires being longer and slimmer than objects typically targeted in general video inpainting tasks. We introduce a new VWI dataset with a novel mask generation strategy, namely Wire Removal Video dataset 2 (WRV2) and Pseudo Wire-Shaped (PWS) Masks. WRV2 dataset comprises over 4,000 videos with an average length of 80 frames, designed to facilitate the development and efficacy of inpainting models.
arXiv Detail & Related papers (2024-04-24T11:02:13Z)
Towards Online Real-Time Memory-based Video Inpainting Transformers [95.90235034520167]
Inpainting tasks have seen significant improvements in recent years with the rise of deep neural networks and, in particular, vision transformers. We propose a framework to adapt existing inpainting transformers to constraints by memorizing and refining redundant computations. Using this framework with some of the most recent inpainting models, we show great online results with a consistent throughput above 20 frames per second.
arXiv Detail & Related papers (2024-03-24T14:02:25Z)
Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation [44.92712228326116]
Video outpainting is a challenging task, aiming at generating video content outside the viewport of the input video. We introduce MOTIA Mastering Video Outpainting Through Input-Specific Adaptation. MoTIA comprises two main phases: input-specific adaptation and pattern-aware outpainting.
arXiv Detail & Related papers (2024-03-20T16:53:45Z)
AVID: Any-Length Video Inpainting with Diffusion Model [30.860927136236374]
We introduce Any-Length Video Inpainting with Diffusion Model, dubbed as AVID. Our model is equipped with effective motion modules and adjustable structure guidance, for fixed-length video inpainting. Our experiments show our model can robustly deal with various inpainting types at different video duration ranges, with high quality.
arXiv Detail & Related papers (2023-12-06T18:56:14Z)
Latent Video Diffusion Models for High-Fidelity Long Video Generation [58.346702410885236]
We introduce lightweight video diffusion models using a low-dimensional 3D latent space. We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. Our framework generates more realistic and longer videos than previous strong baselines.
arXiv Detail & Related papers (2022-11-23T18:58:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.