Low-Resolution Self-Attention for Semantic Segmentation
- URL: http://arxiv.org/abs/2310.05026v1
- Date: Sun, 8 Oct 2023 06:10:09 GMT
- Title: Low-Resolution Self-Attention for Semantic Segmentation
- Authors: Yu-Huan Wu, Shi-Chen Zhang, Yun Liu, Le Zhang, Xin Zhan, Daquan Zhou,
Jiashi Feng, Ming-Ming Cheng, Liangli Zhen
- Abstract summary: We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost.
Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution.
We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
- Score: 96.81482872022237
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Semantic segmentation tasks naturally require high-resolution information for
pixel-wise segmentation and global context information for class prediction.
While existing vision transformers demonstrate promising performance, they
often utilize high resolution context modeling, resulting in a computational
bottleneck. In this work, we challenge conventional wisdom and introduce the
Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a
significantly reduced computational cost. Our approach involves computing
self-attention in a fixed low-resolution space regardless of the input image's
resolution, with additional 3x3 depth-wise convolutions to capture fine details
in the high-resolution space. We demonstrate the effectiveness of our LRSA
approach by building the LRFormer, a vision transformer with an encoder-decoder
structure. Extensive experiments on the ADE20K, COCO-Stuff, and Cityscapes
datasets demonstrate that LRFormer outperforms state-of-the-art models. The
code will be made available at https://github.com/yuhuan-wu/LRFormer.
Related papers
- HRDecoder: High-Resolution Decoder Network for Fundus Image Lesion Segmentation [12.606794661369959]
We propose HRDecoder, a simple High-Resolution Decoder network for fundus lesion segmentation.
It integrates a high-resolution representation learning module to capture fine-grained local features and a high-resolution fusion module to fuse multi-scale predictions.
Our method effectively improves the overall segmentation accuracy of fundus lesions while consuming reasonable memory and computational overhead, and maintaining satisfying inference speed.
arXiv Detail & Related papers (2024-11-06T15:13:31Z) - UTSRMorph: A Unified Transformer and Superresolution Network for Unsupervised Medical Image Registration [4.068692674719378]
Complicated image registration is a key issue in medical image analysis.
We propose a novel unsupervised image registration method named the unified Transformer and superresolution (UTSRMorph) network.
arXiv Detail & Related papers (2024-10-27T06:28:43Z) - A Single Transformer for Scalable Vision-Language Modeling [74.05173379908703]
We present SOLO, a single transformer for visiOn-Language mOdeling.
A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs.
In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM.
arXiv Detail & Related papers (2024-07-08T22:40:15Z) - MLIC++: Linear Complexity Multi-Reference Entropy Modeling for Learned
Image Compression [30.71965784982577]
We introduce MEM++, which captures diverse range of correlations inherent in the latent representation.
MEM++ achieves state-of-the-art performance, reducing BD-rate by 13.39% on the Kodak dataset compared to VTM-17.0 in PSNR.
MLIC++ exhibits linear GPU memory consumption with resolution, making it highly suitable for high-resolution image coding.
arXiv Detail & Related papers (2023-07-28T09:11:37Z) - Recursive Generalization Transformer for Image Super-Resolution [108.67898547357127]
We propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images.
We combine the RG-SA with local self-attention to enhance the exploitation of the global context.
Our RGT outperforms recent state-of-the-art methods quantitatively and qualitatively.
arXiv Detail & Related papers (2023-03-11T10:44:44Z) - LSR: A Light-Weight Super-Resolution Method [36.14816868964436]
LSR predicts the residual image between the interpolated low-resolution (ILR) and high-resolution (HR) images using a self-supervised framework.
It consists of three modules: 1) generation of a pool of rich and diversified representations in the neighborhood of a target pixel via unsupervised learning, 2) selecting a subset from the representation pool that is most relevant to the underlying super-resolution task automatically via supervised learning, 3) predicting the residual of the target pixel via regression.
arXiv Detail & Related papers (2023-02-27T09:02:35Z) - Learning Enriched Features for Fast Image Restoration and Enhancement [166.17296369600774]
This paper presents a holistic goal of maintaining spatially-precise high-resolution representations through the entire network.
We learn an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
Our approach achieves state-of-the-art results for a variety of image processing tasks, including defocus deblurring, image denoising, super-resolution, and image enhancement.
arXiv Detail & Related papers (2022-04-19T17:59:45Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Deep Generative Adversarial Residual Convolutional Networks for
Real-World Super-Resolution [31.934084942626257]
We propose a deep Super-Resolution Residual Convolutional Generative Adversarial Network (SRResCGAN)
It follows the real-world degradation settings by adversarial training the model with pixel-wise supervision in the HR domain from its generated LR counterpart.
The proposed network exploits the residual learning by minimizing the energy-based objective function with powerful image regularization and convex optimization techniques.
arXiv Detail & Related papers (2020-05-03T00:12:38Z) - Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task.
We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network.
Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.