MTV-Inpaint: Multi-Task Long Video Inpainting
- URL: http://arxiv.org/abs/2503.11412v1
- Date: Fri, 14 Mar 2025 13:54:10 GMT
- Title: MTV-Inpaint: Multi-Task Long Video Inpainting
- Authors: Shiyuan Yang, Zheng Gu, Liang Hou, Xin Tao, Pengfei Wan, Xiaodong Chen, Jing Liao,
- Abstract summary: Video inpainting involves modifying local regions within a video, ensuring spatial and temporal consistency.<n>Recent advancements in text-to-video (T2V) diffusion models pave the way for text-guided video inpainting.<n>We propose MTV-Inpaint, a unified multi-task video inpainting framework capable of handling both traditional scene completion and novel object insertion tasks.
- Score: 30.963300199975656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video inpainting involves modifying local regions within a video, ensuring spatial and temporal consistency. Most existing methods focus primarily on scene completion (i.e., filling missing regions) and lack the capability to insert new objects into a scene in a controllable manner. Fortunately, recent advancements in text-to-video (T2V) diffusion models pave the way for text-guided video inpainting. However, directly adapting T2V models for inpainting remains limited in unifying completion and insertion tasks, lacks input controllability, and struggles with long videos, thereby restricting their applicability and flexibility. To address these challenges, we propose MTV-Inpaint, a unified multi-task video inpainting framework capable of handling both traditional scene completion and novel object insertion tasks. To unify these distinct tasks, we design a dual-branch spatial attention mechanism in the T2V diffusion U-Net, enabling seamless integration of scene completion and object insertion within a single framework. In addition to textual guidance, MTV-Inpaint supports multimodal control by integrating various image inpainting models through our proposed image-to-video (I2V) inpainting mode. Additionally, we propose a two-stage pipeline that combines keyframe inpainting with in-between frame propagation, enabling MTV-Inpaint to effectively handle long videos with hundreds of frames. Extensive experiments demonstrate that MTV-Inpaint achieves state-of-the-art performance in both scene completion and object insertion tasks. Furthermore, it demonstrates versatility in derived applications such as multi-modal inpainting, object editing, removal, image object brush, and the ability to handle long videos. Project page: https://mtv-inpaint.github.io/.
Related papers
- UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts [20.955898491009656]
UniPaint is a generative space-time video inpainting framework that enables spatial-temporal inpainting.<n>UniPaint produces high-quality and aesthetically pleasing results, achieving the best results across various tasks and scale setups.
arXiv Detail & Related papers (2024-12-09T09:45:14Z) - MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing [90.30646271720919]
Novel View Synthesis (NVS) and 3D generation have recently achieved prominent improvements.
We propose MVInpainter, re-formulating the 3D editing as a multi-view 2D inpainting task.
MVInpainter partially inpaints multi-view images with the reference guidance rather than intractably generating an entirely novel view from scratch.
arXiv Detail & Related papers (2024-08-15T07:57:28Z) - InVi: Object Insertion In Videos Using Off-the-Shelf Diffusion Models [46.587906540660455]
We introduce InVi, an approach for inserting or replacing objects within videos using off-the-shelf, text-to-image latent diffusion models.
InVi achieves realistic object insertion with consistent blending and coherence across frames, outperforming existing methods.
arXiv Detail & Related papers (2024-07-15T17:55:09Z) - Raformer: Redundancy-Aware Transformer for Video Wire Inpainting [77.41727407673066]
Video Wire Inpainting (VWI) is a prominent application in video inpainting, aimed at flawlessly removing wires in films or TV series.
Wire removal poses greater challenges due to the wires being longer and slimmer than objects typically targeted in general video inpainting tasks.
We introduce a new VWI dataset with a novel mask generation strategy, namely Wire Removal Video dataset 2 (WRV2) and Pseudo Wire-Shaped (PWS) Masks.
WRV2 dataset comprises over 4,000 videos with an average length of 80 frames, designed to facilitate the development and efficacy of inpainting models.
arXiv Detail & Related papers (2024-04-24T11:02:13Z) - Towards Language-Driven Video Inpainting via Multimodal Large Language Models [116.22805434658567]
We introduce a new task -- language-driven video inpainting.
It uses natural language instructions to guide the inpainting process.
We present the Remove Objects from Videos by Instructions dataset.
arXiv Detail & Related papers (2024-01-18T18:59:13Z) - AVID: Any-Length Video Inpainting with Diffusion Model [30.860927136236374]
We introduce Any-Length Video Inpainting with Diffusion Model, dubbed as AVID.
Our model is equipped with effective motion modules and adjustable structure guidance, for fixed-length video inpainting.
Our experiments show our model can robustly deal with various inpainting types at different video duration ranges, with high quality.
arXiv Detail & Related papers (2023-12-06T18:56:14Z) - A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting [38.53807472111521]
We introduce PowerPaint, the first high-quality and versatile inpainting model that excels in multiple inpainting tasks.
We demonstrate the versatility of the task prompt in PowerPaint by showcasing its effectiveness as a negative prompt for object removal.
We leverage prompt techniques to enable controllable shape-guided object inpainting, enhancing the model's applicability in shape-guided applications.
arXiv Detail & Related papers (2023-12-06T16:34:46Z) - Occlusion-Aware Video Object Inpainting [72.38919601150175]
This paper presents occlusion-aware video object inpainting, which recovers both the complete shape and appearance for occluded objects in videos.
Our technical contribution VOIN jointly performs video object shape completion and occluded texture generation.
For more realistic results, VOIN is optimized using both T-PatchGAN and a newoc-temporal YouTube attention-based multi-class discriminator.
arXiv Detail & Related papers (2021-08-15T15:46:57Z) - Learning Joint Spatial-Temporal Transformations for Video Inpainting [58.939131620135235]
We propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting.
We simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss.
arXiv Detail & Related papers (2020-07-20T16:35:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.