Semantically Consistent Video Inpainting with Conditional Diffusion Models
- URL: http://arxiv.org/abs/2405.00251v1
- Date: Tue, 30 Apr 2024 23:49:26 GMT
- Title: Semantically Consistent Video Inpainting with Conditional Diffusion Models
- Authors: Dylan Green, William Harvey, Saeid Naderiparizi, Matthew Niedoba, Yunpeng Liu, Xiaoxuan Liang, Jonathan Lavington, Ke Zhang, Vasileios Lioutas, Setareh Dabiri, Adam Scibior, Berend Zwartsenberg, Frank Wood,
- Abstract summary: We present a framework for solving problems that require the synthesis of novel content that is not present in other frames.
We show that our method is capable of generating diverse, high-quality inpaintings and synthesizing new content that is spatially, temporally, and semantically consistent with the provided context.
- Score: 16.42354856518832
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Current state-of-the-art methods for video inpainting typically rely on optical flow or attention-based approaches to inpaint masked regions by propagating visual information across frames. While such approaches have led to significant progress on standard benchmarks, they struggle with tasks that require the synthesis of novel content that is not present in other frames. In this paper we reframe video inpainting as a conditional generative modeling problem and present a framework for solving such problems with conditional video diffusion models. We highlight the advantages of using a generative approach for this task, showing that our method is capable of generating diverse, high-quality inpaintings and synthesizing new content that is spatially, temporally, and semantically consistent with the provided context.
Related papers
- LatentColorization: Latent Diffusion-Based Speaker Video Colorization [1.2641141743223379]
We introduce a novel solution for achieving temporal consistency in video colorization.
We demonstrate strong improvements on established image quality metrics compared to other existing methods.
Our dataset encompasses a combination of conventional datasets and videos from television/movies.
arXiv Detail & Related papers (2024-05-09T12:06:06Z) - Lumiere: A Space-Time Diffusion Model for Video Generation [75.54967294846686]
We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once.
This is in contrast to existing video models which synthesize distants followed by temporal super-resolution.
By deploying both spatial and (importantly) temporal down- and up-sampling, our model learns to directly generate a full-frame-rate, low-resolution video.
arXiv Detail & Related papers (2024-01-23T18:05:25Z) - RTQ: Rethinking Video-language Understanding Based on Image-text Model [55.278942477715084]
Video-language understanding presents unique challenges due to the inclusion of highly complex semantic details.
We propose a novel framework called RTQ, which addresses these challenges simultaneously.
Our model demonstrates outstanding performance even in the absence of video-language pre-training.
arXiv Detail & Related papers (2023-12-01T04:51:01Z) - LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video
Translation [21.815083817914843]
We propose a new zero-shot video-to-video translation framework, named textitLatentWarp.
Our approach is simple: to constrain the query tokens to be temporally consistent, we further incorporate a warping operation in the latent space.
Experiment results demonstrate the superiority of textitLatentWarp in achieving video-to-video translation with temporal coherence.
arXiv Detail & Related papers (2023-11-01T08:02:57Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - Internal Video Inpainting by Implicit Long-range Propagation [39.89676105875726]
We propose a novel framework for video inpainting by adopting an internal learning strategy.
We show that this can be achieved implicitly by fitting a convolutional neural network to the known region.
We extend the proposed method to another challenging task: learning to remove an object from a video giving a single object mask in only one frame in a 4K video.
arXiv Detail & Related papers (2021-08-04T08:56:28Z) - Short-Term and Long-Term Context Aggregation Network for Video
Inpainting [126.06302824297948]
Video inpainting aims to restore missing regions of a video and has many applications such as video editing and object removal.
We present a novel context aggregation network to effectively exploit both short-term and long-term frame information for video inpainting.
Experiments show that it outperforms state-of-the-art methods with better inpainting results and fast inpainting speed.
arXiv Detail & Related papers (2020-09-12T03:50:56Z) - Learning Joint Spatial-Temporal Transformations for Video Inpainting [58.939131620135235]
We propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting.
We simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss.
arXiv Detail & Related papers (2020-07-20T16:35:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.