DeViT: Deformed Vision Transformers in Video Inpainting
- URL: http://arxiv.org/abs/2209.13925v1
- Date: Wed, 28 Sep 2022 08:57:14 GMT
- Title: DeViT: Deformed Vision Transformers in Video Inpainting
- Authors: Jiayin Cai, Changlin Li, Xin Tao, Chun Yuan and Yu-Wing Tai
- Abstract summary: We extend previous Transformers with patch alignment by introducing Deformed Patch-based Homography (DePtH)
Second, we introduce Mask Pruning-based Patch Attention (MPPA) to improve patch-wised feature matching.
Third, we introduce a Spatial-Temporal weighting Adaptor (STA) module to obtain accurate attention to spatial-temporal tokens.
- Score: 59.73019717323264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a novel video inpainting method. We make three main
contributions: First, we extended previous Transformers with patch alignment by
introducing Deformed Patch-based Homography (DePtH), which improves patch-level
feature alignments without additional supervision and benefits challenging
scenes with various deformation. Second, we introduce Mask Pruning-based Patch
Attention (MPPA) to improve patch-wised feature matching by pruning out less
essential features and using saliency map. MPPA enhances matching accuracy
between warped tokens with invalid pixels. Third, we introduce a
Spatial-Temporal weighting Adaptor (STA) module to obtain accurate attention to
spatial-temporal tokens under the guidance of the Deformation Factor learned
from DePtH, especially for videos with agile motions. Experimental results
demonstrate that our method outperforms recent methods qualitatively and
quantitatively and achieves a new state-of-the-art.
Related papers
- MS-Former: Memory-Supported Transformer for Weakly Supervised Change
Detection with Patch-Level Annotations [50.79913333804232]
We propose a memory-supported transformer (MS-Former) for weakly supervised change detection.
MS-Former consists of a bi-directional attention block (BAB) and a patch-level supervision scheme (PSS)
Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method in the change detection task.
arXiv Detail & Related papers (2023-11-16T09:57:29Z) - UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision
Transformer for Face Forgery Detection [52.91782218300844]
We propose a novel Unsupervised Inconsistency-Aware method based on Vision Transformer, called UIA-ViT.
Due to the self-attention mechanism, the attention map among patch embeddings naturally represents the consistency relation, making the vision Transformer suitable for the consistency representation learning.
arXiv Detail & Related papers (2022-10-23T15:24:47Z) - Behind Every Domain There is a Shift: Adapting Distortion-aware Vision Transformers for Panoramic Semantic Segmentation [73.48323921632506]
We address panoramic semantic segmentation which is under-explored due to two critical challenges.
First, we propose an upgraded Transformer for Panoramic Semantic, i.e., Trans4PASS+, equipped with Deformable Patch Embedding (DPE) and Deformable (DMLPv2) modules.
Second, we enhance the Mutual Prototypical Adaptation (MPA) strategy via pseudo-label rectification for unsupervised domain adaptive panoramic segmentation.
Third, aside from Pinhole-to-Panoramic (Pin2Pan) adaptation, we create a new dataset (SynPASS) with 9,080 panoramic images
arXiv Detail & Related papers (2022-07-25T00:42:38Z) - Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks.
After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z) - MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions.
Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z) - ViTransPAD: Video Transformer using convolution and self-attention for
Face Presentation Attack Detection [15.70621878093133]
Face Presentation Attack Detection (PAD) is an important measure to prevent spoof attacks for face biometric systems.
Many works based on Convolution Neural Networks (CNNs) for face PAD formulate the problem as an image-level binary task without considering the context.
We propose a Video-based Transformer for face PAD (ViTransPAD) with shorttemporal/range-attention which can not only focus on local details with short attention within a frame but also capture long-range dependencies over frames.
arXiv Detail & Related papers (2022-03-03T08:23:20Z) - Short Range Correlation Transformer for Occluded Person
Re-Identification [4.339510167603376]
We propose a partial feature transformer-based person re-identification framework named PFT.
The proposed PFT utilizes three modules to enhance the efficiency of vision transformer.
Experimental results over occluded and holistic re-identification datasets demonstrate that the proposed PFT network achieves superior performance consistently.
arXiv Detail & Related papers (2022-01-04T11:12:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.