FuseFormer: Fusing Fine-Grained Information in Transformers for Video
Inpainting
- URL: http://arxiv.org/abs/2109.02974v1
- Date: Tue, 7 Sep 2021 10:13:29 GMT
- Title: FuseFormer: Fusing Fine-Grained Information in Transformers for Video
Inpainting
- Authors: Rui Liu, Hanming Deng, Yangyi Huang, Xiaoyu Shi, Lewei Lu, Wenxiu Sun,
Xiaogang Wang, Jifeng Dai, Hongsheng Li
- Abstract summary: We propose FuseFormer, a Transformer model designed for video inpainting via fine-grained feature fusion.
We elaborately insert the soft composition and soft split into the feed-forward network, enabling the 1D linear layers to have the capability of modelling 2D structure.
In both quantitative and qualitative evaluations, our proposed FuseFormer surpasses state-of-the-art methods.
- Score: 77.8621673355983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer, as a strong and flexible architecture for modelling long-range
relations, has been widely explored in vision tasks. However, when used in
video inpainting that requires fine-grained representation, existed method
still suffers from yielding blurry edges in detail due to the hard patch
splitting. Here we aim to tackle this problem by proposing FuseFormer, a
Transformer model designed for video inpainting via fine-grained feature fusion
based on novel Soft Split and Soft Composition operations. The soft split
divides feature map into many patches with given overlapping interval. On the
contrary, the soft composition operates by stitching different patches into a
whole feature map where pixels in overlapping regions are summed up. These two
modules are first used in tokenization before Transformer layers and
de-tokenization after Transformer layers, for effective mapping between tokens
and features. Therefore, sub-patch level information interaction is enabled for
more effective feature propagation between neighboring patches, resulting in
synthesizing vivid content for hole regions in videos. Moreover, in FuseFormer,
we elaborately insert the soft composition and soft split into the feed-forward
network, enabling the 1D linear layers to have the capability of modelling 2D
structure. And, the sub-patch level feature fusion ability is further enhanced.
In both quantitative and qualitative evaluations, our proposed FuseFormer
surpasses state-of-the-art methods. We also conduct detailed analysis to
examine its superiority.
Related papers
- Dynamic Texture Transfer using PatchMatch and Transformers [18.54386654063111]
We propose to handle the task of dynamic texture transfer via a simple yet effective model that utilizes both PatchMatch and Transformers.
The key idea is to decompose the task of dynamic texture transfer into two stages, where the start frame of the target video with the desired dynamic texture is synthesized.
In the second stage, the synthesized image is decomposed into structure-agnostic patches, according to which their corresponding subsequent patches can be predicted.
arXiv Detail & Related papers (2024-02-01T13:58:32Z) - Adaptive Human Matting for Dynamic Videos [62.026375402656754]
Adaptive Matting for Dynamic Videos, termed AdaM, is a framework for simultaneously differentiating foregrounds from backgrounds.
Two interconnected network designs are employed to achieve this goal.
We benchmark and study our methods recently introduced datasets, showing that our matting achieves new best-in-class generalizability.
arXiv Detail & Related papers (2023-04-12T17:55:59Z) - Xformer: Hybrid X-Shaped Transformer for Image Denoising [114.37510775636811]
We present a hybrid X-shaped vision Transformer, named Xformer, which performs notably on image denoising tasks.
Xformer achieves state-of-the-art performance on the synthetic and real-world image denoising tasks.
arXiv Detail & Related papers (2023-03-11T16:32:09Z) - TTVFI: Learning Trajectory-Aware Transformer for Video Frame
Interpolation [50.49396123016185]
Video frame (VFI) aims to synthesize an intermediate frame between two consecutive frames.
We propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI)
Our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks.
arXiv Detail & Related papers (2022-07-19T03:37:49Z) - Multi-feature Co-learning for Image Inpainting [2.4571440831539824]
In this paper, we design a deep multi-feature co-learning network for image inpainting.
To be specific, we first use two branches to learn structure features and texture features separately.
The proposed SDFF module integrates structure features into texture features, and meanwhile uses texture features as an auxiliary in generating structure features.
arXiv Detail & Related papers (2022-05-21T12:15:26Z) - Dual-Level Collaborative Transformer for Image Captioning [126.59298716978577]
We introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features.
In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features.
arXiv Detail & Related papers (2021-01-16T15:43:17Z) - Texture Transform Attention for Realistic Image Inpainting [6.275013056564918]
We propose a Texture Transform Attention network that better produces the missing region inpainting with fine details.
Texture Transform Attention is used to create a new reassembled texture map using fine textures and coarse semantics.
We evaluate our model end-to-end with the publicly available datasets CelebA-HQ and Places2.
arXiv Detail & Related papers (2020-12-08T06:28:51Z) - Region-adaptive Texture Enhancement for Detailed Person Image Synthesis [86.69934638569815]
RATE-Net is a novel framework for synthesizing person images with sharp texture details.
The proposed framework leverages an additional texture enhancing module to extract appearance information from the source image.
Experiments conducted on DeepFashion benchmark dataset have demonstrated the superiority of our framework compared with existing networks.
arXiv Detail & Related papers (2020-05-26T02:33:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.