Efficient Hybrid Transformer: Learning Global-local Context for Urban
Sence Segmentation
- URL: http://arxiv.org/abs/2109.08937v1
- Date: Sat, 18 Sep 2021 13:55:38 GMT
- Title: Efficient Hybrid Transformer: Learning Global-local Context for Urban
Sence Segmentation
- Authors: Libo Wang, Shenghui Fang, Ce Zhang, Rui Li and Chenxi Duan
- Abstract summary: We propose an efficient hybrid Transformer (EHT) for semantic segmentation of urban scene images.
EHT takes advantage of CNNs and Transformer, learning global-local context to strengthen the feature representation.
The proposed EHT achieves a 67.0% mIoU on the UAVid test set and outperforms other lightweight models significantly.
- Score: 11.237929167356725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic segmentation of fine-resolution urban scene images plays a vital
role in extensive practical applications, such as land cover mapping, urban
change detection, environmental protection and economic assessment. Driven by
rapid developments in deep learning technologies, convolutional neural networks
(CNNs) have dominated the semantic segmentation task for many years.
Convolutional neural networks adopt hierarchical feature representation and
have strong local context extraction. However, the local property of the
convolution layer limits the network from capturing global information that is
crucial for improving fine-resolution image segmentation. Recently, Transformer
comprise a hot topic in the computer vision domain. Vision Transformer
demonstrates the great capability of global information modelling, boosting
many vision tasks, such as image classification, object detection and
especially semantic segmentation. In this paper, we propose an efficient hybrid
Transformer (EHT) for semantic segmentation of urban scene images. EHT takes
advantage of CNNs and Transformer, learning global-local context to strengthen
the feature representation. Extensive experiments demonstrate that EHT has
higher efficiency with competitive accuracy compared with state-of-the-art
benchmark methods. Specifically, the proposed EHT achieves a 67.0% mIoU on the
UAVid test set and outperforms other lightweight models significantly. The code
will be available soon.
Related papers
- ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions.
Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks.
We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z) - Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection [76.11864242047074]
We propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions.
We introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training.
Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.
arXiv Detail & Related papers (2023-10-22T02:27:02Z) - Dual-Augmented Transformer Network for Weakly Supervised Semantic
Segmentation [4.02487511510606]
Weakly supervised semantic segmentation (WSSS) is a fundamental computer vision task, which aims to segment out the object within only class-level labels.
Traditional methods adopt the CNN-based network and utilize the class activation map (CAM) strategy to discover the object regions.
An alternative is to explore vision transformers (ViT) to encode the image to acquire the global semantic information.
We propose a dual network with both CNN-based and transformer networks for mutually complementary learning.
arXiv Detail & Related papers (2023-09-30T08:41:11Z) - TcGAN: Semantic-Aware and Structure-Preserved GANs with Individual
Vision Transformer for Fast Arbitrary One-Shot Image Generation [11.207512995742999]
One-shot image generation (OSG) with generative adversarial networks that learn from the internal patches of a given image has attracted world wide attention.
We propose a novel structure-preserved method TcGAN with individual vision transformer to overcome the shortcomings of the existing one-shot image generation methods.
arXiv Detail & Related papers (2023-02-16T03:05:59Z) - RTFormer: Efficient Design for Real-Time Semantic Segmentation with
Transformer [63.25665813125223]
We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation.
It achieves better trade-off between performance and efficiency than CNN-based models.
Experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer.
arXiv Detail & Related papers (2022-10-13T16:03:53Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Semantic Labeling of High Resolution Images Using EfficientUNets and
Transformers [5.177947445379688]
We propose a new segmentation model that combines convolutional neural networks with deep transformers.
Our results demonstrate that the proposed methodology improves segmentation accuracy compared to state-of-the-art techniques.
arXiv Detail & Related papers (2022-06-20T12:03:54Z) - Transformer-Guided Convolutional Neural Network for Cross-View
Geolocalization [20.435023745201878]
We propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture.
Our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context.
Experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12% and 84.92% on CVUSA and CVACT_val, respectively.
arXiv Detail & Related papers (2022-04-21T08:46:41Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.