Decoupled Spatial-Temporal Transformer for Video Inpainting
- URL: http://arxiv.org/abs/2104.06637v1
- Date: Wed, 14 Apr 2021 05:47:46 GMT
- Title: Decoupled Spatial-Temporal Transformer for Video Inpainting
- Authors: Rui Liu, Hanming Deng, Yangyi Huang, Xiaoyu Shi, Lewei Lu, Wenxiu Sun,
Xiaogang Wang, Jifeng Dai, Hongsheng Li
- Abstract summary: Video aims to fill the given holes with realistic appearance but is still a challenging task even with prosperous deep learning approaches.
Recent works introduce the promising Transformer architecture into deep video inpainting and achieve better performance.
We propose a Decoupled Spatial-Temporal Transformer (DSTT) for improving video inpainting with exceptional efficiency.
- Score: 77.8621673355983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video inpainting aims to fill the given spatiotemporal holes with realistic
appearance but is still a challenging task even with prosperous deep learning
approaches. Recent works introduce the promising Transformer architecture into
deep video inpainting and achieve better performance. However, it still suffers
from synthesizing blurry texture as well as huge computational cost. Towards
this end, we propose a novel Decoupled Spatial-Temporal Transformer (DSTT) for
improving video inpainting with exceptional efficiency. Our proposed DSTT
disentangles the task of learning spatial-temporal attention into 2 sub-tasks:
one is for attending temporal object movements on different frames at same
spatial locations, which is achieved by temporally-decoupled Transformer block,
and the other is for attending similar background textures on same frame of all
spatial positions, which is achieved by spatially-decoupled Transformer block.
The interweaving stack of such two blocks makes our proposed model attend
background textures and moving objects more precisely, and thus the attended
plausible and temporally-coherent appearance can be propagated to fill the
holes. In addition, a hierarchical encoder is adopted before the stack of
Transformer blocks, for learning robust and hierarchical features that maintain
multi-level local spatial structure, resulting in the more representative token
vectors. Seamless combination of these two novel designs forms a better
spatial-temporal attention scheme and our proposed model achieves better
performance than state-of-the-art video inpainting approaches with significant
boosted efficiency.
Related papers
- When Spatial meets Temporal in Action Recognition [34.53091498930863]
We introduce the Temporal Integration and Motion Enhancement (TIME) layer, a novel preprocessing technique designed to incorporate temporal information.
The TIME layer generates new video frames by rearranging the original sequence, preserving temporal order while embedding $N2$ temporally evolving frames into a single spatial grid.
Our experiments show that the TIME layer enhances recognition accuracy, offering valuable insights for video processing tasks.
arXiv Detail & Related papers (2024-11-22T16:39:45Z) - Decouple Content and Motion for Conditional Image-to-Video Generation [6.634105805557556]
conditional image-to-video (cI2V) generation is to create a believable new video by beginning with the condition, i.e., one image and text.
Previous cI2V generation methods conventionally perform in RGB pixel space, with limitations in modeling motion consistency and visual continuity.
We propose a novel approach by disentangling the target RGB pixels into two distinct components: spatial content and temporal motions.
arXiv Detail & Related papers (2023-11-24T06:08:27Z) - Latent-Shift: Latent Diffusion with Temporal Shift for Efficient
Text-to-Video Generation [115.09597127418452]
Latent-Shift is an efficient text-to-video generation method based on a pretrained text-to-image generation model.
We show that Latent-Shift achieves comparable or better results while being significantly more efficient.
arXiv Detail & Related papers (2023-04-17T17:57:06Z) - Blur Interpolation Transformer for Real-World Motion from Blur [52.10523711510876]
We propose a encoded blur transformer (BiT) to unravel the underlying temporal correlation in blur.
Based on multi-scale residual Swin transformer blocks, we introduce dual-end temporal supervision and temporally symmetric ensembling strategies.
In addition, we design a hybrid camera system to collect the first real-world dataset of one-to-many blur-sharp video pairs.
arXiv Detail & Related papers (2022-11-21T13:10:10Z) - Time-Space Transformers for Video Panoptic Segmentation [3.2489082010225494]
We propose a solution that simultaneously predicts pixel-level semantic and clip-level instance segmentation.
Our network, named VPS-Transformer, combines a convolutional architecture for single-frame panoptic segmentation and a video module based on an instantiation of a pure Transformer block.
arXiv Detail & Related papers (2022-10-07T13:30:11Z) - TubeDETR: Spatio-Temporal Video Grounding with Transformers [89.71617065426146]
We consider the problem of encoder localizing a-temporal tube in a video corresponding to a given text query.
To address this task, we propose TubeDETR, a transformer- architecture inspired by the recent success of such models for text-conditioned object detection.
arXiv Detail & Related papers (2022-03-30T16:31:49Z) - Occlusion-Aware Video Object Inpainting [72.38919601150175]
This paper presents occlusion-aware video object inpainting, which recovers both the complete shape and appearance for occluded objects in videos.
Our technical contribution VOIN jointly performs video object shape completion and occluded texture generation.
For more realistic results, VOIN is optimized using both T-PatchGAN and a newoc-temporal YouTube attention-based multi-class discriminator.
arXiv Detail & Related papers (2021-08-15T15:46:57Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.