AVID: Any-Length Video Inpainting with Diffusion Model
- URL: http://arxiv.org/abs/2312.03816v3
- Date: Fri, 29 Mar 2024 19:50:38 GMT
- Title: AVID: Any-Length Video Inpainting with Diffusion Model
- Authors: Zhixing Zhang, Bichen Wu, Xiaoyan Wang, Yaqiao Luo, Luxin Zhang, Yinan Zhao, Peter Vajda, Dimitris Metaxas, Licheng Yu,
- Abstract summary: We introduce Any-Length Video Inpainting with Diffusion Model, dubbed as AVID.
Our model is equipped with effective motion modules and adjustable structure guidance, for fixed-length video inpainting.
Our experiments show our model can robustly deal with various inpainting types at different video duration ranges, with high quality.
- Score: 30.860927136236374
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in diffusion models have successfully enabled text-guided image inpainting. While it seems straightforward to extend such editing capability into the video domain, there have been fewer works regarding text-guided video inpainting. Given a video, a masked region at its initial frame, and an editing prompt, it requires a model to do infilling at each frame following the editing guidance while keeping the out-of-mask region intact. There are three main challenges in text-guided video inpainting: ($i$) temporal consistency of the edited video, ($ii$) supporting different inpainting types at different structural fidelity levels, and ($iii$) dealing with variable video length. To address these challenges, we introduce Any-Length Video Inpainting with Diffusion Model, dubbed as AVID. At its core, our model is equipped with effective motion modules and adjustable structure guidance, for fixed-length video inpainting. Building on top of that, we propose a novel Temporal MultiDiffusion sampling pipeline with a middle-frame attention guidance mechanism, facilitating the generation of videos with any desired duration. Our comprehensive experiments show our model can robustly deal with various inpainting types at different video duration ranges, with high quality. More visualization results are made publicly available at https://zhang-zx.github.io/AVID/ .
Related papers
- Video Diffusion Models are Strong Video Inpainter [14.402778136825642]
We propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI)
We propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code.
Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video.
arXiv Detail & Related papers (2024-08-21T08:01:00Z) - RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing
with Diffusion Models [19.792535444735957]
RAVE is a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training.
It produces high-quality videos while preserving original motion and semantic structure.
RAVE is capable of a wide range of edits, from local attribute modifications to shape transformations.
arXiv Detail & Related papers (2023-12-07T18:43:45Z) - VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion
Models [96.55004961251889]
Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks.
Our model can edit and translate the desired results within seconds based on user instructions.
We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-30T18:59:52Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - Hierarchical Masked 3D Diffusion Model for Video Outpainting [20.738731220322176]
We introduce a masked 3D diffusion model for video outpainting.
This allows us to use multiple guide frames to connect the results of multiple video clip inferences.
We also introduce a hybrid coarse-to-fine inference pipeline to alleviate the artifact accumulation problem.
arXiv Detail & Related papers (2023-09-05T10:52:21Z) - TokenFlow: Consistent Diffusion Features for Consistent Video Editing [27.736354114287725]
We present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing.
Our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video.
Our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method.
arXiv Detail & Related papers (2023-07-19T18:00:03Z) - ControlVideo: Training-free Controllable Text-to-Video Generation [117.06302461557044]
ControlVideo is a framework to enable natural and efficient text-to-video generation.
It generates both short and long videos within several minutes using one NVIDIA 2080Ti.
arXiv Detail & Related papers (2023-05-22T14:48:53Z) - Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models [68.31777975873742]
Recent attempts at video editing require significant text-to-video data and computation resources for training.
We propose vid2vid-zero, a simple yet effective method for zero-shot video editing.
Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos.
arXiv Detail & Related papers (2023-03-30T17:59:25Z) - Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video.
The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection.
We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z) - Learning Joint Spatial-Temporal Transformations for Video Inpainting [58.939131620135235]
We propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting.
We simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss.
arXiv Detail & Related papers (2020-07-20T16:35:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.