DiTPainter: Efficient Video Inpainting with Diffusion Transformers
- URL: http://arxiv.org/abs/2504.15661v3
- Date: Mon, 19 May 2025 09:52:55 GMT
- Title: DiTPainter: Efficient Video Inpainting with Diffusion Transformers
- Authors: Xian Wu, Chang Liu,
- Abstract summary: We present DiTPainter, an end-to-end video inpainting model based on Diffusion Transformer (DiT)<n>DiTPainter uses an efficient transformer network designed for video inpainting, which is trained from scratch instead of initializing from any large pretrained models.<n>Experiments show that DiTPainter outperforms existing video inpainting algorithms with higher quality and better spatial-temporal consistency.
- Score: 35.1896530415315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many existing video inpainting algorithms utilize optical flows to construct the corresponding maps and then propagate pixels from adjacent frames to missing areas by mapping. Despite the effectiveness of the propagation mechanism, they might encounter blurry and inconsistencies when dealing with inaccurate optical flows or large masks. Recently, Diffusion Transformer (DiT) has emerged as a revolutionary technique for video generation tasks. However, pretrained DiT models for video generation all contain a large amount of parameters, which makes it very time consuming to apply to video inpainting tasks. In this paper, we present DiTPainter, an end-to-end video inpainting model based on Diffusion Transformer (DiT). DiTPainter uses an efficient transformer network designed for video inpainting, which is trained from scratch instead of initializing from any large pretrained models. DiTPainter can address videos with arbitrary lengths and can be applied to video decaptioning and video completion tasks with an acceptable time cost. Experiments show that DiTPainter outperforms existing video inpainting algorithms with higher quality and better spatial-temporal consistency.
Related papers
- OutDreamer: Video Outpainting with a Diffusion Transformer [37.512451098188635]
We introduce OutDreamer, a DiT-based video outpainting framework.<n>We propose a mask-driven self-attention layer that dynamically integrates the given mask information.<n>For long video outpainting, we employ a cross-video-clip refiner to iteratively generate missing content.
arXiv Detail & Related papers (2025-06-27T15:08:54Z) - EraserDiT: Fast Video Inpainting with Diffusion Transformer Model [6.616553739135743]
This paper introduces a novel video inpainting approach leveraging the Diffusion Transformer (DiT)<n>DiT synergistically combines the advantages of diffusion models and transformer architectures to maintain long-term temporal consistency.<n>It takes only 180 seconds to complete a video with a resolution of $1080 1920$ with 121 frames without any acceleration method.
arXiv Detail & Related papers (2025-06-15T13:59:57Z) - FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors [64.54220123913154]
We introduce FramePainter as an efficient instantiation of image-to-video generation problem.<n>It only uses a lightweight sparse control encoder to inject editing signals.<n>It domainantly outperforms previous state-of-the-art methods with far less training data.
arXiv Detail & Related papers (2025-01-14T16:09:16Z) - Video Diffusion Models are Strong Video Inpainter [14.402778136825642]
We propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI)<n>We propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code.<n>Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video.
arXiv Detail & Related papers (2024-08-21T08:01:00Z) - Towards Online Real-Time Memory-based Video Inpainting Transformers [95.90235034520167]
Inpainting tasks have seen significant improvements in recent years with the rise of deep neural networks and, in particular, vision transformers.
We propose a framework to adapt existing inpainting transformers to constraints by memorizing and refining redundant computations.
Using this framework with some of the most recent inpainting models, we show great online results with a consistent throughput above 20 frames per second.
arXiv Detail & Related papers (2024-03-24T14:02:25Z) - TDViT: Temporal Dilated Video Transformer for Dense Video Tasks [35.16197118579414]
Temporal Dilated Video Transformer (TDTTB) can efficiently extract video representations and effectively alleviate the negative effect of temporal redundancy.
Experiments are conducted on two different dense video benchmarks, i.e., ImageNet VID for video object detection and YouTube VIS for video segmentation instance.
arXiv Detail & Related papers (2024-02-14T15:41:07Z) - AVID: Any-Length Video Inpainting with Diffusion Model [30.860927136236374]
We introduce Any-Length Video Inpainting with Diffusion Model, dubbed as AVID.
Our model is equipped with effective motion modules and adjustable structure guidance, for fixed-length video inpainting.
Our experiments show our model can robustly deal with various inpainting types at different video duration ranges, with high quality.
arXiv Detail & Related papers (2023-12-06T18:56:14Z) - ProPainter: Improving Propagation and Transformer for Video Inpainting [98.70898369695517]
Flow-based propagation and computational Transformer are two mainstream mechanisms in video intemporal (VI)
We introduce dual-domain propagation that combines the advantages of image and feature warping, exploiting global correspondences reliably.
We also propose a mask-guided sparse video Transformer, which achieves high efficiency by discarding redundant tokens.
arXiv Detail & Related papers (2023-09-07T17:57:29Z) - Hierarchical Masked 3D Diffusion Model for Video Outpainting [20.738731220322176]
We introduce a masked 3D diffusion model for video outpainting.
This allows us to use multiple guide frames to connect the results of multiple video clip inferences.
We also introduce a hybrid coarse-to-fine inference pipeline to alleviate the artifact accumulation problem.
arXiv Detail & Related papers (2023-09-05T10:52:21Z) - Latent-Shift: Latent Diffusion with Temporal Shift for Efficient
Text-to-Video Generation [115.09597127418452]
Latent-Shift is an efficient text-to-video generation method based on a pretrained text-to-image generation model.
We show that Latent-Shift achieves comparable or better results while being significantly more efficient.
arXiv Detail & Related papers (2023-04-17T17:57:06Z) - Decoupled Spatial-Temporal Transformer for Video Inpainting [77.8621673355983]
Video aims to fill the given holes with realistic appearance but is still a challenging task even with prosperous deep learning approaches.
Recent works introduce the promising Transformer architecture into deep video inpainting and achieve better performance.
We propose a Decoupled Spatial-Temporal Transformer (DSTT) for improving video inpainting with exceptional efficiency.
arXiv Detail & Related papers (2021-04-14T05:47:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.