HINT: High-quality INPainting Transformer with Mask-Aware Encoding and
Enhanced Attention
- URL: http://arxiv.org/abs/2402.14185v1
- Date: Thu, 22 Feb 2024 00:14:26 GMT
- Title: HINT: High-quality INPainting Transformer with Mask-Aware Encoding and
Enhanced Attention
- Authors: Shuang Chen, Amir Atapour-Abarghouei, Hubert P. H. Shum
- Abstract summary: Existing image inpainting methods leverage convolution-based downsampling approaches to reduce spatial dimensions.
We propose an end-to-end High-quality INpainting Transformer, abbreviated as HINT, which consists of a novel mask-aware pixel-shuffle downsampling module.
We demonstrate the superior performance of HINT compared to contemporary state-of-the-art models on four datasets.
- Score: 14.055584700641212
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing image inpainting methods leverage convolution-based downsampling
approaches to reduce spatial dimensions. This may result in information loss
from corrupted images where the available information is inherently sparse,
especially for the scenario of large missing regions. Recent advances in
self-attention mechanisms within transformers have led to significant
improvements in many computer vision tasks including inpainting. However,
limited by the computational costs, existing methods cannot fully exploit the
efficacy of long-range modelling capabilities of such models. In this paper, we
propose an end-to-end High-quality INpainting Transformer, abbreviated as HINT,
which consists of a novel mask-aware pixel-shuffle downsampling module (MPD) to
preserve the visible information extracted from the corrupted image while
maintaining the integrity of the information available for high-level
inferences made within the model. Moreover, we propose a Spatially-activated
Channel Attention Layer (SCAL), an efficient self-attention mechanism
interpreting spatial awareness to model the corrupted image at multiple scales.
To further enhance the effectiveness of SCAL, motivated by recent advanced in
speech recognition, we introduce a sandwich structure that places feed-forward
networks before and after the SCAL module. We demonstrate the superior
performance of HINT compared to contemporary state-of-the-art models on four
datasets, CelebA, CelebA-HQ, Places2, and Dunhuang.
Related papers
- Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL.
We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution.
Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - Multi-Scale Representation Learning for Image Restoration with State-Space Model [13.622411683295686]
We propose a novel Multi-Scale State-Space Model-based (MS-Mamba) for efficient image restoration.
Our proposed method achieves new state-of-the-art performance while maintaining low computational complexity.
arXiv Detail & Related papers (2024-08-19T16:42:58Z) - Robust CLIP-Based Detector for Exposing Diffusion Model-Generated Images [13.089550724738436]
Diffusion models (DMs) have revolutionized image generation, producing high-quality images with applications spanning various fields.
Their ability to create hyper-realistic images poses significant challenges in distinguishing between real and synthetic content.
This work introduces a robust detection framework that integrates image and text features extracted by CLIP model with a Multilayer Perceptron (MLP) classifier.
arXiv Detail & Related papers (2024-04-19T14:30:41Z) - DRCT: Saving Image Super-resolution away from Information Bottleneck [7.765333471208582]
Vision Transformer-based approaches for low-level vision tasks have achieved widespread success.
Dense-residual-connected Transformer (DRCT) is proposed to mitigate the loss of spatial information.
Our approach surpasses state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2024-03-31T15:34:45Z) - HAT: Hybrid Attention Transformer for Image Restoration [61.74223315807691]
Transformer-based methods have shown impressive performance in image restoration tasks, such as image super-resolution and denoising.
We propose a new Hybrid Attention Transformer (HAT) to activate more input pixels for better restoration.
Our HAT achieves state-of-the-art performance both quantitatively and qualitatively.
arXiv Detail & Related papers (2023-09-11T05:17:55Z) - Prompt-based Ingredient-Oriented All-in-One Image Restoration [0.0]
We propose a novel data ingredient-oriented approach to tackle multiple image degradation tasks.
Specifically, we utilize a encoder to capture features and introduce prompts with degradation-specific information to guide the decoder.
Our method performs competitively to the state-of-the-art.
arXiv Detail & Related papers (2023-09-06T15:05:04Z) - Learning Enriched Features for Fast Image Restoration and Enhancement [166.17296369600774]
This paper presents a holistic goal of maintaining spatially-precise high-resolution representations through the entire network.
We learn an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
Our approach achieves state-of-the-art results for a variety of image processing tasks, including defocus deblurring, image denoising, super-resolution, and image enhancement.
arXiv Detail & Related papers (2022-04-19T17:59:45Z) - MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions.
Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task.
We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network.
Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.