Transformer Meets Convolution: A Bilateral Awareness Net-work for
Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images
- URL: http://arxiv.org/abs/2106.12413v1
- Date: Wed, 23 Jun 2021 13:57:36 GMT
- Title: Transformer Meets Convolution: A Bilateral Awareness Net-work for
Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images
- Authors: Libo Wang, Rui Li, Dongzhi Wang, Chenxi Duan, Teng Wang, Xiaoliang
Meng
- Abstract summary: We propose a bilateral awareness network (BANet) which contains a dependency path and a texture path.
BANet captures the long-range relationships and fine-grained details in VFR images.
Experiments conducted on the three large-scale urban scene image segmentation datasets, i.e., ISPRS Vaihingen dataset, ISPRS Potsdam dataset, and UAVid dataset, demonstrate the effective-ness of BANet.
- Score: 6.460167724233707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic segmentation from very fine resolution (VFR) urban scene images
plays a significant role in several application scenarios including autonomous
driving, land cover classification, and urban planning, etc. However, the
tremendous details contained in the VFR image severely limit the potential of
the existing deep learning approaches. More seriously, the considerable
variations in scale and appearance of objects further deteriorate the
representational capacity of those se-mantic segmentation methods, leading to
the confusion of adjacent objects. Addressing such is-sues represents a
promising research field in the remote sensing community, which paves the way
for scene-level landscape pattern analysis and decision making. In this
manuscript, we pro-pose a bilateral awareness network (BANet) which contains a
dependency path and a texture path to fully capture the long-range
relationships and fine-grained details in VFR images. Specif-ically, the
dependency path is conducted based on the ResT, a novel Transformer backbone
with memory-efficient multi-head self-attention, while the texture path is
built on the stacked convo-lution operation. Besides, using the linear
attention mechanism, a feature aggregation module (FAM) is designed to
effectively fuse the dependency features and texture features. Extensive
experiments conducted on the three large-scale urban scene image segmentation
datasets, i.e., ISPRS Vaihingen dataset, ISPRS Potsdam dataset, and UAVid
dataset, demonstrate the effective-ness of our BANet. Specifically, a 64.6%
mIoU is achieved on the UAVid dataset.
Related papers
- BEVPose: Unveiling Scene Semantics through Pose-Guided Multi-Modal BEV Alignment [8.098296280937518]
We present BEVPose, a framework that integrates BEV representations from camera and lidar data, using sensor pose as a guiding supervisory signal.
By leveraging pose information, we align and fuse multi-modal sensory inputs, facilitating the learning of latent BEV embeddings that capture both geometric and semantic aspects of the environment.
arXiv Detail & Related papers (2024-10-28T12:40:27Z) - Deep Multimodal Fusion for Semantic Segmentation of Remote Sensing Earth Observation Data [0.08192907805418582]
This paper proposes a late fusion deep learning model (LF-DLM) for semantic segmentation.
One branch integrates detailed textures from aerial imagery captured by UNetFormer with a Multi-Axis Vision Transformer (ViT) backbone.
The other branch captures complex-temporal dynamics from the Sentinel-2 satellite imageMax time series using a U-ViNet with Temporal Attention (U-TAE)
arXiv Detail & Related papers (2024-10-01T07:50:37Z) - Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers [59.0181939916084]
Traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries.
We propose a novel Priors Distillation (RPD) method to extract priors from the well-trained transformers on massive images.
Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification.
arXiv Detail & Related papers (2024-07-26T06:29:09Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Progressively Dual Prior Guided Few-shot Semantic Segmentation [57.37506990980975]
Few-shot semantic segmentation task aims at performing segmentation in query images with a few annotated support samples.
We propose a progressively dual prior guided few-shot semantic segmentation network.
arXiv Detail & Related papers (2022-11-20T16:19:47Z) - Aerial Images Meet Crowdsourced Trajectories: A New Approach to Robust
Road Extraction [110.61383502442598]
We introduce a novel neural network framework termed Cross-Modal Message Propagation Network (CMMPNet)
CMMPNet is composed of two deep Auto-Encoders for modality-specific representation learning and a tailor-designed Dual Enhancement Module for cross-modal representation refinement.
Experiments on three real-world benchmarks demonstrate the effectiveness of our CMMPNet for robust road extraction.
arXiv Detail & Related papers (2021-11-30T04:30:10Z) - Learning to Aggregate Multi-Scale Context for Instance Segmentation in
Remote Sensing Images [28.560068780733342]
A novel context aggregation network (CATNet) is proposed to improve the feature extraction process.
The proposed model exploits three lightweight plug-and-play modules, namely dense feature pyramid network (DenseFPN), spatial context pyramid ( SCP), and hierarchical region of interest extractor (HRoIE)
arXiv Detail & Related papers (2021-11-22T08:55:25Z) - Looking Outside the Window: Wider-Context Transformer for the Semantic
Segmentation of High-Resolution Remote Sensing Images [18.161847218988964]
We propose a Wider-Context Network (WiCNet) for the semantic segmentation of High-Resolution (HR) Remote Sensing Images (RSIs)
In the WiCNet, apart from a conventional feature extraction network, an extra context branch is designed to explicitly model the context information in a larger image area.
The information between the two branches is communicated through a Context Transformer, which is a novel design derived from the Vision Transformer to model the long-range context correlations.
arXiv Detail & Related papers (2021-06-29T23:41:54Z) - High-resolution Depth Maps Imaging via Attention-based Hierarchical
Multi-modal Fusion [84.24973877109181]
We propose a novel attention-based hierarchical multi-modal fusion network for guided DSR.
We show that our approach outperforms state-of-the-art methods in terms of reconstruction accuracy, running speed and memory efficiency.
arXiv Detail & Related papers (2021-04-04T03:28:33Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.