Flow-Guided Transformer for Video Inpainting
- URL: http://arxiv.org/abs/2208.06768v1
- Date: Sun, 14 Aug 2022 03:10:01 GMT
- Title: Flow-Guided Transformer for Video Inpainting
- Authors: Kaidong Zhang, Jingjing Fu, Dong Liu
- Abstract summary: We propose a flow-guided transformer, which innovatively leverage the motion discrepancy exposed by optical flows to instruct the attention retrieval in transformer for high fidelity video inpainting.
With the completed flows, we propagate the content across video frames, and adopt the flow-guided transformer to synthesize the rest corrupted regions.
We decouple transformers along temporal and spatial dimension, so that we can easily integrate the locally relevant completed flows to instruct spatial attention only.
- Score: 10.31469470212101
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a flow-guided transformer, which innovatively leverage the motion
discrepancy exposed by optical flows to instruct the attention retrieval in
transformer for high fidelity video inpainting. More specially, we design a
novel flow completion network to complete the corrupted flows by exploiting the
relevant flow features in a local temporal window. With the completed flows, we
propagate the content across video frames, and adopt the flow-guided
transformer to synthesize the rest corrupted regions. We decouple transformers
along temporal and spatial dimension, so that we can easily integrate the
locally relevant completed flows to instruct spatial attention only.
Furthermore, we design a flow-reweight module to precisely control the impact
of completed flows on each spatial transformer. For the sake of efficiency, we
introduce window partition strategy to both spatial and temporal transformers.
Especially in spatial transformer, we design a dual perspective spatial MHSA,
which integrates the global tokens to the window-based attention. Extensive
experiments demonstrate the effectiveness of the proposed method qualitatively
and quantitatively. Codes are available at https://github.com/hitachinsk/FGT.
Related papers
- Video Motion Transfer with Diffusion Transformers [82.4796313201512]
We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one.
We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal.
We apply our strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities.
arXiv Detail & Related papers (2024-12-10T18:59:58Z) - A Hybrid Transformer-Mamba Network for Single Image Deraining [70.64069487982916]
Existing deraining Transformers employ self-attention mechanisms with fixed-range windows or along channel dimensions.
We introduce a novel dual-branch hybrid Transformer-Mamba network, denoted as TransMamba, aimed at effectively capturing long-range rain-related dependencies.
arXiv Detail & Related papers (2024-08-31T10:03:19Z) - WcDT: World-centric Diffusion Transformer for Traffic Scene Generation [13.616763172038846]
We introduce a novel approach for autonomous driving trajectory generation by harnessing the complementary strengths of diffusion probabilistic models and transformers.
Our proposed framework, termed the "World-Centric Diffusion Transformer"(WcDT), optimize the entire trajectory generation process.
Our results show that the proposed approach exhibits superior performance in generating both realistic and diverse trajectories.
arXiv Detail & Related papers (2024-04-02T16:28:41Z) - Motion-Aware Video Frame Interpolation [49.49668436390514]
We introduce a Motion-Aware Video Frame Interpolation (MA-VFI) network, which directly estimates intermediate optical flow from consecutive frames.
It not only extracts global semantic relationships and spatial details from input frames with different receptive fields, but also effectively reduces the required computational cost and complexity.
arXiv Detail & Related papers (2024-02-05T11:00:14Z) - Dual Aggregation Transformer for Image Super-Resolution [92.41781921611646]
We propose a novel Transformer model, Dual Aggregation Transformer, for image SR.
Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner.
Our experiments show that our DAT surpasses current methods.
arXiv Detail & Related papers (2023-08-07T07:39:39Z) - RFR-WWANet: Weighted Window Attention-Based Recovery Feature Resolution
Network for Unsupervised Image Registration [7.446209993071451]
The Swin transformer has attracted attention in medical image analysis due to its computational efficiency and long-range modeling capability.
The registration models based on transformers combine multiple voxels into a single semantic token.
This merging process limits the transformers to model and generate coarse-grained spatial information.
We propose Recovery Feature Resolution Network (RFRNet), which allows the transformer to contribute fine-grained spatial information.
arXiv Detail & Related papers (2023-05-07T09:57:29Z) - Dual-path Adaptation from Image to Video Transformers [62.056751480114784]
We efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.
We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block.
arXiv Detail & Related papers (2023-03-17T09:37:07Z) - Exploiting Optical Flow Guidance for Transformer-Based Video Inpainting [11.837764007052813]
We propose flow-guided transformer (FGT) to pursue more effective and efficient video inpainting.
FGT++ is experimentally evaluated to be outperforming the existing video inpainting networks.
arXiv Detail & Related papers (2023-01-24T14:44:44Z) - Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [63.99222215387881]
We propose Evo-ViT, a self-motivated slow-fast token evolution method for vision transformers.
Our method can significantly reduce the computational costs of vision transformers while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-08-03T09:56:07Z) - Augmented Shortcuts for Vision Transformers [49.70151144700589]
We study the relationship between shortcuts and feature diversity in vision transformer models.
We present an augmented shortcut scheme, which inserts additional paths with learnable parameters in parallel on the original shortcuts.
Experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2021-06-30T09:48:30Z) - MODETR: Moving Object Detection with Transformers [2.4366811507669124]
Moving Object Detection (MOD) is a crucial task for the Autonomous Driving pipeline.
In this paper, we tackle this problem through multi-head attention mechanisms, both across the spatial and motion streams.
We propose MODETR; a Moving Object DEtection TRansformer network, comprised of multi-stream transformers for both spatial and motion modalities.
arXiv Detail & Related papers (2021-06-21T21:56:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.