CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution
- URL: http://arxiv.org/abs/2503.06896v1
- Date: Mon, 10 Mar 2025 04:00:27 GMT
- Title: CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution
- Authors: Xin Liu, Jie Liu, Jie Tang, Gangshan Wu,
- Abstract summary: Transformer-based methods have demonstrated impressive performance in low-level visual tasks such as Image Super-Resolution (SR)<n>These methods limit attention to content-agnostic local regions, limiting directly the ability of attention to capture long-range dependency.<n>We propose a lightweight Content-Aware Token Aggregation Network (CATANet) to address these issues.<n>Our method achieves superior performance, with a maximum PSNR improvement of 0.33dB and nearly double the inference speed.
- Score: 42.76046559103463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based methods have demonstrated impressive performance in low-level visual tasks such as Image Super-Resolution (SR). However, its computational complexity grows quadratically with the spatial resolution. A series of works attempt to alleviate this problem by dividing Low-Resolution images into local windows, axial stripes, or dilated windows. SR typically leverages the redundancy of images for reconstruction, and this redundancy appears not only in local regions but also in long-range regions. However, these methods limit attention computation to content-agnostic local regions, limiting directly the ability of attention to capture long-range dependency. To address these issues, we propose a lightweight Content-Aware Token Aggregation Network (CATANet). Specifically, we propose an efficient Content-Aware Token Aggregation module for aggregating long-range content-similar tokens, which shares token centers across all image tokens and updates them only during the training phase. Then we utilize intra-group self-attention to enable long-range information interaction. Moreover, we design an inter-group cross-attention to further enhance global information interaction. The experimental results show that, compared with the state-of-the-art cluster-based method SPIN, our method achieves superior performance, with a maximum PSNR improvement of 0.33dB and nearly double the inference speed.
Related papers
- When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning [31.696397337675847]
Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images.
We propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration.
Our method outperforms existing high-resolution strategies on four datasets using the same data.
arXiv Detail & Related papers (2025-03-10T17:51:16Z) - DDU-Net: A Domain Decomposition-based CNN for High-Resolution Image Segmentation on Multiple GPUs [46.873264197900916]
A domain decomposition-based U-Net architecture is introduced, which partitions input images into non-overlapping patches.
A communication network is added to facilitate inter-patch information exchange to enhance the understanding of spatial context.
Results show that the approach achieves a $2-3,%$ higher intersection over union (IoU) score compared to the same network without inter-patch communication.
arXiv Detail & Related papers (2024-07-31T01:07:21Z) - Channel-Partitioned Windowed Attention And Frequency Learning for Single Image Super-Resolution [1.8506868409351092]
Window-based attention methods have shown great potential for computer vision tasks, particularly in Single Image Super-Resolution (SISR)
We propose a new Channel-Partitioned Attention Transformer (CPAT) to better capture long-range dependencies by sequentially expanding windows along the height and width of feature maps.
In addition, we propose a novel Spatial-Frequency Interaction Module (SFIM), which incorporates information from spatial and frequency domains to provide a more comprehensive information from feature maps.
arXiv Detail & Related papers (2024-07-23T07:17:10Z) - CiaoSR: Continuous Implicit Attention-in-Attention Network for
Arbitrary-Scale Image Super-Resolution [158.2282163651066]
This paper proposes a continuous implicit attention-in-attention network, called CiaoSR.
We explicitly design an implicit attention network to learn the ensemble weights for the nearby local features.
We embed a scale-aware attention in this implicit attention network to exploit additional non-local information.
arXiv Detail & Related papers (2022-12-08T15:57:46Z) - Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration.
ART presents both dense and sparse attention modules in the network.
We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z) - Learning Enriched Features for Fast Image Restoration and Enhancement [166.17296369600774]
This paper presents a holistic goal of maintaining spatially-precise high-resolution representations through the entire network.
We learn an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
Our approach achieves state-of-the-art results for a variety of image processing tasks, including defocus deblurring, image denoising, super-resolution, and image enhancement.
arXiv Detail & Related papers (2022-04-19T17:59:45Z) - Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task.
We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network.
Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z) - Image Fine-grained Inpainting [89.17316318927621]
We present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields.
To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss.
We also employ a discriminator with local and global branches to ensure local-global contents consistency.
arXiv Detail & Related papers (2020-02-07T03:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.