Lawin Transformer: Improving Semantic Segmentation Transformer with
Multi-Scale Representations via Large Window Attention
- URL: http://arxiv.org/abs/2201.01615v4
- Date: Wed, 9 Aug 2023 14:15:32 GMT
- Title: Lawin Transformer: Improving Semantic Segmentation Transformer with
Multi-Scale Representations via Large Window Attention
- Authors: Haotian Yan and Chuang Zhang and Ming Wu
- Abstract summary: Multi-scale representations are crucial for semantic segmentation.
In this paper, we introduce multi-scale representations into semantic segmentation ViT via window attention mechanism.
Our resulting ViT, Lawin Transformer, is composed of an efficient vision transformer (HVT) as encoder and a LawinASPP as decoder.
- Score: 16.75003034164463
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-scale representations are crucial for semantic segmentation. The
community has witnessed the flourish of semantic segmentation convolutional
neural networks (CNN) exploiting multi-scale contextual information. Motivated
by that the vision transformer (ViT) is powerful in image classification, some
semantic segmentation ViTs are recently proposed, most of them attaining
impressive results but at a cost of computational economy. In this paper, we
succeed in introducing multi-scale representations into semantic segmentation
ViT via window attention mechanism and further improves the performance and
efficiency. To this end, we introduce large window attention which allows the
local window to query a larger area of context window at only a little
computation overhead. By regulating the ratio of the context area to the query
area, we enable the $\textit{large window attention}$ to capture the contextual
information at multiple scales. Moreover, the framework of spatial pyramid
pooling is adopted to collaborate with $\textit{the large window attention}$,
which presents a novel decoder named $\textbf{la}$rge $\textbf{win}$dow
attention spatial pyramid pooling (LawinASPP) for semantic segmentation ViT.
Our resulting ViT, Lawin Transformer, is composed of an efficient hierachical
vision transformer (HVT) as encoder and a LawinASPP as decoder. The empirical
results demonstrate that Lawin Transformer offers an improved efficiency
compared to the existing method. Lawin Transformer further sets new
state-of-the-art performance on Cityscapes (84.4% mIoU), ADE20K (56.2% mIoU)
and COCO-Stuff datasets. The code will be released at
https://github.com/yan-hao-tian/lawin
Related papers
- SegViTv2: Exploring Efficient and Continual Semantic Segmentation with
Plain Vision Transformers [76.13755422671822]
This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework.
We introduce a novel Attention-to-Mask (atm) module to design a lightweight decoder effective for plain ViT.
Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about $5%$ of the computational cost.
arXiv Detail & Related papers (2023-06-09T22:29:56Z) - Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers.
Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z) - RTFormer: Efficient Design for Real-Time Semantic Segmentation with
Transformer [63.25665813125223]
We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation.
It achieves better trade-off between performance and efficiency than CNN-based models.
Experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer.
arXiv Detail & Related papers (2022-10-13T16:03:53Z) - SSformer: A Lightweight Transformer for Semantic Segmentation [7.787950060560868]
Swin Transformer set a new record in various vision tasks by using hierarchical architecture and shifted windows.
We design a lightweight yet effective transformer model, called SSformer.
Experimental results show the proposed SSformer yields comparable mIoU performance with state-of-the-art models.
arXiv Detail & Related papers (2022-08-03T12:57:00Z) - HorNet: Efficient High-Order Spatial Interactions with Recursive Gated
Convolutions [109.33112814212129]
We show that input-adaptive, long-range and high-order spatial interactions can be efficiently implemented with a convolution-based framework.
We present the Recursive Gated Convolution ($textitgtextitn$Conv) that performs high-order spatial interactions with gated convolutions.
Based on the operation, we construct a new family of generic vision backbones named HorNet.
arXiv Detail & Related papers (2022-07-28T17:59:02Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z) - Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer [20.92010433074935]
We propose a new vision transformer, named Shuffle Transformer, which is highly efficient and easy to implement by modifying two lines of code.
The proposed architectures achieve excellent performance on a wide range of visual tasks including image-level classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2021-06-07T14:22:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.