Transformer Meets Convolution: A Bilateral Awareness Net-work for
Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images
- URL: http://arxiv.org/abs/2106.12413v1
- Date: Wed, 23 Jun 2021 13:57:36 GMT
- Title: Transformer Meets Convolution: A Bilateral Awareness Net-work for
Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images
- Authors: Libo Wang, Rui Li, Dongzhi Wang, Chenxi Duan, Teng Wang, Xiaoliang
Meng
- Abstract summary: We propose a bilateral awareness network (BANet) which contains a dependency path and a texture path.
BANet captures the long-range relationships and fine-grained details in VFR images.
Experiments conducted on the three large-scale urban scene image segmentation datasets, i.e., ISPRS Vaihingen dataset, ISPRS Potsdam dataset, and UAVid dataset, demonstrate the effective-ness of BANet.
- Score: 6.460167724233707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic segmentation from very fine resolution (VFR) urban scene images
plays a significant role in several application scenarios including autonomous
driving, land cover classification, and urban planning, etc. However, the
tremendous details contained in the VFR image severely limit the potential of
the existing deep learning approaches. More seriously, the considerable
variations in scale and appearance of objects further deteriorate the
representational capacity of those se-mantic segmentation methods, leading to
the confusion of adjacent objects. Addressing such is-sues represents a
promising research field in the remote sensing community, which paves the way
for scene-level landscape pattern analysis and decision making. In this
manuscript, we pro-pose a bilateral awareness network (BANet) which contains a
dependency path and a texture path to fully capture the long-range
relationships and fine-grained details in VFR images. Specif-ically, the
dependency path is conducted based on the ResT, a novel Transformer backbone
with memory-efficient multi-head self-attention, while the texture path is
built on the stacked convo-lution operation. Besides, using the linear
attention mechanism, a feature aggregation module (FAM) is designed to
effectively fuse the dependency features and texture features. Extensive
experiments conducted on the three large-scale urban scene image segmentation
datasets, i.e., ISPRS Vaihingen dataset, ISPRS Potsdam dataset, and UAVid
dataset, demonstrate the effective-ness of our BANet. Specifically, a 64.6%
mIoU is achieved on the UAVid dataset.
Related papers
- Breaking the Frame: Image Retrieval by Visual Overlap Prediction [53.17564423756082]
We propose a novel visual place recognition approach, VOP, that efficiently addresses occlusions and complex scenes.
The proposed method enables the identification of visible image sections without requiring expensive feature detection and matching.
arXiv Detail & Related papers (2024-06-23T20:00:20Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Hi-ResNet: A High-Resolution Remote Sensing Network for Semantic
Segmentation [7.216053041550996]
High-resolution remote sensing (HRS) semantic segmentation extracts key objects from high-resolution coverage areas.
objects of the same category within HRS images show significant differences in scale and shape across diverse geographical environments.
We propose a High-resolution remote sensing network (Hi-ResNet) with efficient network structure designs.
arXiv Detail & Related papers (2023-05-22T03:58:25Z) - Hierarchical Disentanglement-Alignment Network for Robust SAR Vehicle
Recognition [18.38295403066007]
HDANet integrates feature disentanglement and alignment into a unified framework.
The proposed method demonstrates impressive robustness across nine operating conditions in the MSTAR dataset.
arXiv Detail & Related papers (2023-04-07T09:11:29Z) - Progressively Dual Prior Guided Few-shot Semantic Segmentation [57.37506990980975]
Few-shot semantic segmentation task aims at performing segmentation in query images with a few annotated support samples.
We propose a progressively dual prior guided few-shot semantic segmentation network.
arXiv Detail & Related papers (2022-11-20T16:19:47Z) - Aerial Images Meet Crowdsourced Trajectories: A New Approach to Robust
Road Extraction [110.61383502442598]
We introduce a novel neural network framework termed Cross-Modal Message Propagation Network (CMMPNet)
CMMPNet is composed of two deep Auto-Encoders for modality-specific representation learning and a tailor-designed Dual Enhancement Module for cross-modal representation refinement.
Experiments on three real-world benchmarks demonstrate the effectiveness of our CMMPNet for robust road extraction.
arXiv Detail & Related papers (2021-11-30T04:30:10Z) - Learning to Aggregate Multi-Scale Context for Instance Segmentation in
Remote Sensing Images [28.560068780733342]
A novel context aggregation network (CATNet) is proposed to improve the feature extraction process.
The proposed model exploits three lightweight plug-and-play modules, namely dense feature pyramid network (DenseFPN), spatial context pyramid ( SCP), and hierarchical region of interest extractor (HRoIE)
arXiv Detail & Related papers (2021-11-22T08:55:25Z) - Looking Outside the Window: Wider-Context Transformer for the Semantic
Segmentation of High-Resolution Remote Sensing Images [18.161847218988964]
We propose a Wider-Context Network (WiCNet) for the semantic segmentation of High-Resolution (HR) Remote Sensing Images (RSIs)
In the WiCNet, apart from a conventional feature extraction network, an extra context branch is designed to explicitly model the context information in a larger image area.
The information between the two branches is communicated through a Context Transformer, which is a novel design derived from the Vision Transformer to model the long-range context correlations.
arXiv Detail & Related papers (2021-06-29T23:41:54Z) - High-resolution Depth Maps Imaging via Attention-based Hierarchical
Multi-modal Fusion [84.24973877109181]
We propose a novel attention-based hierarchical multi-modal fusion network for guided DSR.
We show that our approach outperforms state-of-the-art methods in terms of reconstruction accuracy, running speed and memory efficiency.
arXiv Detail & Related papers (2021-04-04T03:28:33Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.