Hierarchical Masked 3D Diffusion Model for Video Outpainting
- URL: http://arxiv.org/abs/2309.02119v3
- Date: Fri, 19 Jan 2024 08:50:28 GMT
- Title: Hierarchical Masked 3D Diffusion Model for Video Outpainting
- Authors: Fanda Fan, Chaoxu Guo, Litong Gong, Biao Wang, Tiezheng Ge, Yuning
Jiang, Chunjie Luo, Jianfeng Zhan
- Abstract summary: We introduce a masked 3D diffusion model for video outpainting.
This allows us to use multiple guide frames to connect the results of multiple video clip inferences.
We also introduce a hybrid coarse-to-fine inference pipeline to alleviate the artifact accumulation problem.
- Score: 20.738731220322176
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video outpainting aims to adequately complete missing areas at the edges of
video frames. Compared to image outpainting, it presents an additional
challenge as the model should maintain the temporal consistency of the filled
area. In this paper, we introduce a masked 3D diffusion model for video
outpainting. We use the technique of mask modeling to train the 3D diffusion
model. This allows us to use multiple guide frames to connect the results of
multiple video clip inferences, thus ensuring temporal consistency and reducing
jitter between adjacent frames. Meanwhile, we extract the global frames of the
video as prompts and guide the model to obtain information other than the
current video clip using cross-attention. We also introduce a hybrid
coarse-to-fine inference pipeline to alleviate the artifact accumulation
problem. The existing coarse-to-fine pipeline only uses the infilling strategy,
which brings degradation because the time interval of the sparse frames is too
large. Our pipeline benefits from bidirectional learning of the mask modeling
and thus can employ a hybrid strategy of infilling and interpolation when
generating sparse frames. Experiments show that our method achieves
state-of-the-art results in video outpainting tasks. More results and codes are
provided at our https://fanfanda.github.io/M3DDM/.
Related papers
- Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models [54.35214051961381]
3D meshes are widely used in computer vision and graphics for their efficiency in animation and minimal memory use in movies, games, AR, and VR.
However, creating temporal consistent and realistic textures for mesh remains labor-intensive for professional artists.
We present 3D Tex sequences that integrates inherent geometry from mesh sequences with video diffusion models to produce consistent textures.
arXiv Detail & Related papers (2024-10-14T17:59:59Z) - Video Diffusion Models are Strong Video Inpainter [14.402778136825642]
We propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI)
We propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code.
Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video.
arXiv Detail & Related papers (2024-08-21T08:01:00Z) - SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix [60.48666051245761]
We propose a pose-free and training-free approach for generating 3D stereoscopic videos.
Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth.
We develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting.
arXiv Detail & Related papers (2024-06-29T08:33:55Z) - AVID: Any-Length Video Inpainting with Diffusion Model [30.860927136236374]
We introduce Any-Length Video Inpainting with Diffusion Model, dubbed as AVID.
Our model is equipped with effective motion modules and adjustable structure guidance, for fixed-length video inpainting.
Our experiments show our model can robustly deal with various inpainting types at different video duration ranges, with high quality.
arXiv Detail & Related papers (2023-12-06T18:56:14Z) - Learning 3D Photography Videos via Self-supervised Diffusion on Single
Images [105.81348348510551]
3D photography renders a static image into a video with appealing 3D visual effects.
Existing approaches typically first conduct monocular depth estimation, then render the input frame to subsequent frames with various viewpoints.
We present a novel task: out-animation, which extends the space and time of input objects.
arXiv Detail & Related papers (2023-02-21T16:18:40Z) - Unsupervised Video Interpolation by Learning Multilayered 2.5D Motion
Fields [75.81417944207806]
This paper presents a self-supervised approach to video frame learning that requires only a single video.
We parameterize the video motions by solving an ordinary differentiable equation (ODE) defined on a time-varying motion field.
This implicit neural representation learns the video as a space-time continuum, allowing frame-time continuum at any temporal resolution.
arXiv Detail & Related papers (2022-04-21T06:17:05Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - Learning Joint Spatial-Temporal Transformations for Video Inpainting [58.939131620135235]
We propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting.
We simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss.
arXiv Detail & Related papers (2020-07-20T16:35:48Z) - DVI: Depth Guided Video Inpainting for Autonomous Driving [35.94330601020169]
We present an automatic video inpainting algorithm that can remove traffic agents from videos.
By building a dense 3D map from stitched point clouds, frames within a video are geometrically correlated.
We are the first to fuse multiple videos for video inpainting.
arXiv Detail & Related papers (2020-07-17T09:29:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.