Diverse Image Inpainting with Bidirectional and Autoregressive
Transformers
- URL: http://arxiv.org/abs/2104.12335v1
- Date: Mon, 26 Apr 2021 03:52:27 GMT
- Title: Diverse Image Inpainting with Bidirectional and Autoregressive
Transformers
- Authors: Yingchen Yu, Fangneng Zhan, Rongliang Wu, Jianxiong Pan, Kaiwen Cui,
Shijian Lu, Feiying Ma, Xuansong Xie, Chunyan Miao
- Abstract summary: We propose BAT-Fill, an image inpainting framework with a novel bidirectional autoregressive transformer (BAT)
BAT-Fill inherits the merits of transformers and CNNs in a two-stage manner, which allows to generate high-resolution contents without being constrained by the quadratic complexity of attention in transformers.
- Score: 55.21000775547243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image inpainting is an underdetermined inverse problem, it naturally allows
diverse contents that fill up the missing or corrupted regions reasonably and
realistically. Prevalent approaches using convolutional neural networks (CNNs)
can synthesize visually pleasant contents, but CNNs suffer from limited
perception fields for capturing global features. With image-level attention,
transformers enable to model long-range dependencies and generate diverse
contents with autoregressive modeling of pixel-sequence distributions. However,
the unidirectional attention in transformers is suboptimal as corrupted regions
can have arbitrary shapes with contexts from arbitrary directions. We propose
BAT-Fill, an image inpainting framework with a novel bidirectional
autoregressive transformer (BAT) that models deep bidirectional contexts for
autoregressive generation of diverse inpainting contents. BAT-Fill inherits the
merits of transformers and CNNs in a two-stage manner, which allows to generate
high-resolution contents without being constrained by the quadratic complexity
of attention in transformers. Specifically, it first generates pluralistic
image structures of low resolution by adapting transformers and then
synthesizes realistic texture details of high resolutions with a CNN-based
up-sampling network. Extensive experiments over multiple datasets show that
BAT-Fill achieves superior diversity and fidelity in image inpainting
qualitatively and quantitatively.
Related papers
- FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model [76.84519526283083]
We present the textbfFlexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with textitunrestricted resolutions and aspect ratios
FiTv2 exhibits $2times$ convergence speed of FiT, when incorporating advanced training-free extrapolation techniques.
Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions.
arXiv Detail & Related papers (2024-10-17T15:51:49Z) - SwinStyleformer is a favorable choice for image inversion [2.8115030277940947]
This paper proposes the first pure Transformer structure inversion network called SwinStyleformer.
Experiments found that the inversion network with the Transformer backbone could not successfully invert the image.
arXiv Detail & Related papers (2024-06-19T02:08:45Z) - FiT: Flexible Vision Transformer for Diffusion Model [81.85667773832279]
We present a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios.
Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens.
Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions.
arXiv Detail & Related papers (2024-02-19T18:59:07Z) - Optimizing Vision Transformers for Medical Image Segmentation and
Few-Shot Domain Adaptation [11.690799827071606]
We propose Convolutional Swin-Unet (CS-Unet) transformer blocks and optimise their settings with relation to patch embedding, projection, the feed-forward network, up sampling and skip connections.
CS-Unet can be trained from scratch and inherits the superiority of convolutions in each feature process phase.
Experiments show that CS-Unet without pre-training surpasses other state-of-the-art counterparts by large margins on two medical CT and MRI datasets with fewer parameters.
arXiv Detail & Related papers (2022-10-14T19:18:52Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Restormer: Efficient Transformer for High-Resolution Image Restoration [118.9617735769827]
convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data.
Transformers have shown significant performance gains on natural language and high-level vision tasks.
Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks.
arXiv Detail & Related papers (2021-11-18T18:59:10Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.