HIPA: Hierarchical Patch Transformer for Single Image Super Resolution
- URL: http://arxiv.org/abs/2203.10247v2
- Date: Wed, 7 Jun 2023 01:39:31 GMT
- Title: HIPA: Hierarchical Patch Transformer for Single Image Super Resolution
- Authors: Qing Cai, Yiming Qian, Jinxing Li, Jun Lv, Yee-Hong Yang, Feng Wu,
David Zhang
- Abstract summary: This paper presents HIPA, a novel Transformer architecture that progressively recovers the high resolution image using a hierarchical patch partition.
We build a cascaded model that processes an input image in multiple stages, where we start with tokens with small patch sizes and gradually merge to the full resolution.
Such a hierarchical patch mechanism not only explicitly enables feature aggregation at multiple resolutions but also adaptively learns patch-aware features for different image regions.
- Score: 62.7081074931892
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based architectures start to emerge in single image super
resolution (SISR) and have achieved promising performance. Most existing Vision
Transformers divide images into the same number of patches with a fixed size,
which may not be optimal for restoring patches with different levels of texture
richness. This paper presents HIPA, a novel Transformer architecture that
progressively recovers the high resolution image using a hierarchical patch
partition. Specifically, we build a cascaded model that processes an input
image in multiple stages, where we start with tokens with small patch sizes and
gradually merge to the full resolution. Such a hierarchical patch mechanism not
only explicitly enables feature aggregation at multiple resolutions but also
adaptively learns patch-aware features for different image regions, e.g., using
a smaller patch for areas with fine details and a larger patch for textureless
regions. Meanwhile, a new attention-based position encoding scheme for
Transformer is proposed to let the network focus on which tokens should be paid
more attention by assigning different weights to different tokens, which is the
first time to our best knowledge. Furthermore, we also propose a new
multi-reception field attention module to enlarge the convolution reception
field from different branches. The experimental results on several public
datasets demonstrate the superior performance of the proposed HIPA over
previous methods quantitatively and qualitatively.
Related papers
- DBAT: Dynamic Backward Attention Transformer for Material Segmentation
with Cross-Resolution Patches [8.812837829361923]
We propose the Dynamic Backward Attention Transformer (DBAT) to aggregate cross-resolution features.
Experiments show that our DBAT achieves an accuracy of 86.85%, which is the best performance among state-of-the-art real-time models.
We further align features to semantic labels, performing network dissection, to infer that the proposed model can extract material-related features better than other methods.
arXiv Detail & Related papers (2023-05-06T03:47:20Z) - From Coarse to Fine: Hierarchical Pixel Integration for Lightweight
Image Super-Resolution [41.0555613285837]
Transformer-based models have achieved competitive performances in image super-resolution (SR)
We propose a new attention block whose insights are from the interpretation of Local Map (LAM) for SR networks.
In the fine area, we use an Intra-Patch Self-Attention Attribution (IPSA) module to model long-range pixel dependencies in a local patch.
arXiv Detail & Related papers (2022-11-30T06:32:34Z) - Masked Transformer for image Anomaly Localization [14.455765147827345]
We propose a new model for image anomaly detection based on Vision Transformer architecture with patch masking.
We show that multi-resolution patches and their collective embeddings provide a large improvement in the model's performance.
The proposed model has been tested on popular anomaly detection datasets such as MVTec and head CT.
arXiv Detail & Related papers (2022-10-27T15:30:48Z) - Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration.
ART presents both dense and sparse attention modules in the network.
We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z) - Learned Distributed Image Compression with Multi-Scale Patch Matching in
Feature Domai [62.88240343479615]
We propose Multi-Scale Feature Domain Patch Matching (MSFDPM) to fully utilize side information at the decoder of the distributed image compression model.
MSFDPM consists of a side information feature extractor, a multi-scale feature domain patch matching module, and a multi-scale feature fusion network.
Our patch matching in a multi-scale feature domain further improves compression rate by about 20% compared with the patch matching method at image domain.
arXiv Detail & Related papers (2022-09-06T14:06:46Z) - Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks.
After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z) - MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions.
Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z) - PPT Fusion: Pyramid Patch Transformerfor a Case Study in Image Fusion [37.993611194758195]
We propose a Patch PyramidTransformer(PPT) to address the issues of extracting semantic information from an image.
The experimental results demonstrate its superior performance against the state-of-the-art fusion approaches.
arXiv Detail & Related papers (2021-07-29T13:57:45Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - ResT: An Efficient Transformer for Visual Recognition [5.807423409327807]
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition.
We show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones.
arXiv Detail & Related papers (2021-05-28T08:53:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.