Learning Joint Spatial-Temporal Transformations for Video Inpainting
- URL: http://arxiv.org/abs/2007.10247v1
- Date: Mon, 20 Jul 2020 16:35:48 GMT
- Title: Learning Joint Spatial-Temporal Transformations for Video Inpainting
- Authors: Yanhong Zeng, Jianlong Fu, Hongyang Chao
- Abstract summary: We propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting.
We simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss.
- Score: 58.939131620135235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-quality video inpainting that completes missing regions in video frames
is a promising yet challenging task. State-of-the-art approaches adopt
attention models to complete a frame by searching missing contents from
reference frames, and further complete whole videos frame by frame. However,
these approaches can suffer from inconsistent attention results along spatial
and temporal dimensions, which often leads to blurriness and temporal artifacts
in videos. In this paper, we propose to learn a joint Spatial-Temporal
Transformer Network (STTN) for video inpainting. Specifically, we
simultaneously fill missing regions in all input frames by self-attention, and
propose to optimize STTN by a spatial-temporal adversarial loss. To show the
superiority of the proposed model, we conduct both quantitative and qualitative
evaluations by using standard stationary masks and more realistic moving object
masks. Demo videos are available at https://github.com/researchmm/STTN.
Related papers
- Hierarchical Masked 3D Diffusion Model for Video Outpainting [20.738731220322176]
We introduce a masked 3D diffusion model for video outpainting.
This allows us to use multiple guide frames to connect the results of multiple video clip inferences.
We also introduce a hybrid coarse-to-fine inference pipeline to alleviate the artifact accumulation problem.
arXiv Detail & Related papers (2023-09-05T10:52:21Z) - Unsupervised Video Interpolation by Learning Multilayered 2.5D Motion
Fields [75.81417944207806]
This paper presents a self-supervised approach to video frame learning that requires only a single video.
We parameterize the video motions by solving an ordinary differentiable equation (ODE) defined on a time-varying motion field.
This implicit neural representation learns the video as a space-time continuum, allowing frame-time continuum at any temporal resolution.
arXiv Detail & Related papers (2022-04-21T06:17:05Z) - Spatial-Temporal Residual Aggregation for High Resolution Video
Inpainting [14.035620730770528]
Recent learning-based inpainting algorithms have achieved compelling results for completing missing regions after removing undesired objects in videos.
We propose STRA-Net, a novel spatial-temporal residual aggregation framework for high resolution video inpainting.
Both the quantitative and qualitative evaluations show that we can produce more temporal-coherent and visually appealing results than the state-of-the-art methods on inpainting high resolution videos.
arXiv Detail & Related papers (2021-11-05T15:50:31Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - Spatiotemporal Inconsistency Learning for DeepFake Video Detection [51.747219106855624]
We present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions.
And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation.
arXiv Detail & Related papers (2021-09-04T13:05:37Z) - Occlusion-Aware Video Object Inpainting [72.38919601150175]
This paper presents occlusion-aware video object inpainting, which recovers both the complete shape and appearance for occluded objects in videos.
Our technical contribution VOIN jointly performs video object shape completion and occluded texture generation.
For more realistic results, VOIN is optimized using both T-PatchGAN and a newoc-temporal YouTube attention-based multi-class discriminator.
arXiv Detail & Related papers (2021-08-15T15:46:57Z) - StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN [70.31913835035206]
We present a novel approach to the video synthesis problem that helps to greatly improve visual quality.
We make use of a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for.
Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes.
arXiv Detail & Related papers (2021-07-15T09:58:15Z) - Attention-guided Temporal Coherent Video Object Matting [78.82835351423383]
We propose a novel deep learning-based object matting method that can achieve temporally coherent matting results.
Its key component is an attention-based temporal aggregation module that maximizes image matting networks' strength.
We show how to effectively solve the trimap generation problem by fine-tuning a state-of-the-art video object segmentation network.
arXiv Detail & Related papers (2021-05-24T17:34:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.