SRFormerV2: Taking a Closer Look at Permuted Self-Attention for Image Super-Resolution
- URL: http://arxiv.org/abs/2303.09735v2
- Date: Wed, 14 Aug 2024 06:12:36 GMT
- Title: SRFormerV2: Taking a Closer Look at Permuted Self-Attention for Image Super-Resolution
- Authors: Yupeng Zhou, Zhen Li, Chun-Le Guo, Li Liu, Ming-Ming Cheng, Qibin Hou,
- Abstract summary: We present SRFormer, a simple but novel method that can enjoy the benefit of large window self-attention.
Our SRFormer achieves a 33.86dB PSNR score on the Urban100 dataset, which is 0.46dB higher than that of SwinIR.
Experiments show that our scaled model, named SRFormerV2, can further improve the results and achieves state-of-the-art.
- Score: 74.48610723198514
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Previous works have shown that increasing the window size for Transformer-based image super-resolution models (e.g., SwinIR) can significantly improve the model performance. Still, the computation overhead is also considerable when the window size gradually increases. In this paper, we present SRFormer, a simple but novel method that can enjoy the benefit of large window self-attention but introduces even less computational burden. The core of our SRFormer is the permuted self-attention (PSA), which strikes an appropriate balance between the channel and spatial information for self-attention. Without any bells and whistles, we show that our SRFormer achieves a 33.86dB PSNR score on the Urban100 dataset, which is 0.46dB higher than that of SwinIR but uses fewer parameters and computations. In addition, we also attempt to scale up the model by further enlarging the window size and channel numbers to explore the potential of Transformer-based models. Experiments show that our scaled model, named SRFormerV2, can further improve the results and achieves state-of-the-art. We hope our simple and effective approach could be useful for future research in super-resolution model design. The homepage is https://z-yupeng.github.io/SRFormer/.
Related papers
- GRFormer: Grouped Residual Self-Attention for Lightweight Single Image Super-Resolution [2.312414367096445]
Grouped Residual Self-Attention (GRSA) is specifically oriented towards two fundamental components.
ES-RPB is a substitute for the original relative position bias to improve the ability to represent position information.
Experiments demonstrate GRFormer outperforms state-of-the-art transformer-based methods for $times$2, $times$3 and $times$4 SISR tasks.
arXiv Detail & Related papers (2024-08-14T11:56:35Z) - HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution [70.52256118833583]
We present a strategy to convert transformer-based SR networks to hierarchical transformers (HiT-SR)
Specifically, we first replace the commonly used fixed small windows with expanding hierarchical windows to aggregate features at different scales.
Considering the intensive computation required for large windows, we further design a spatial-channel correlation method with linear complexity to window sizes.
arXiv Detail & Related papers (2024-07-08T12:42:10Z) - SparseSpikformer: A Co-Design Framework for Token and Weight Pruning in
Spiking Transformer [12.717450255837178]
Spiking Neural Network (SNN) has the advantages of low power consumption and high energy efficiency.
The most advanced SNN, Spikformer, combines the self-attention module from Transformer with SNN to achieve remarkable performance.
We present SparseSpikformer, a co-design framework aimed at achieving sparsity in Spikformer through token and weight pruning techniques.
arXiv Detail & Related papers (2023-11-15T09:22:52Z) - HAT: Hybrid Attention Transformer for Image Restoration [61.74223315807691]
Transformer-based methods have shown impressive performance in image restoration tasks, such as image super-resolution and denoising.
We propose a new Hybrid Attention Transformer (HAT) to activate more input pixels for better restoration.
Our HAT achieves state-of-the-art performance both quantitatively and qualitatively.
arXiv Detail & Related papers (2023-09-11T05:17:55Z) - SDXL: Improving Latent Diffusion Models for High-Resolution Image
Synthesis [8.648456572970035]
We present SDXL, a latent diffusion model for text-to-image synthesis.
Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone.
We show drastically improved performance compared the previous versions of Stable Diffusion.
arXiv Detail & Related papers (2023-07-04T23:04:57Z) - Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design [84.34416126115732]
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration.
We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers.
Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute.
arXiv Detail & Related papers (2023-05-22T13:39:28Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - Learned Queries for Efficient Local Attention [11.123272845092611]
Self-attention mechanism in vision transformers suffers from high latency and inefficient memory utilization.
We propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner.
We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
arXiv Detail & Related papers (2021-12-21T18:52:33Z) - Enhancing sensor resolution improves CNN accuracy given the same number
of parameters or FLOPS [53.10151901863263]
We show that it is almost always possible to modify a network such that it achieves higher accuracy at a higher input resolution while having the same number of parameters or/and FLOPS.
Preliminary empirical investigation over MNIST, Fashion MNIST, and CIFAR10 datasets demonstrates the efficiency of the proposed approach.
arXiv Detail & Related papers (2021-03-09T06:47:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.