PyramidMamba: Rethinking Pyramid Feature Fusion with Selective Space State Model for Semantic Segmentation of Remote Sensing Imagery
- URL: http://arxiv.org/abs/2406.10828v1
- Date: Sun, 16 Jun 2024 07:43:40 GMT
- Title: PyramidMamba: Rethinking Pyramid Feature Fusion with Selective Space State Model for Semantic Segmentation of Remote Sensing Imagery
- Authors: Libo Wang, Dongxu Li, Sijun Dong, Xiaoliang Meng, Xiaokang Zhang, Danfeng Hong,
- Abstract summary: We propose a novel Mamba-based segmentation network, namely PyramidMamba.
Specifically, we design a dense spatial pyramid pooling (DSPP) to encode rich multi-scale semantic features and a pyramid fusion Mamba (PFM) to reduce semantic redundancy in multi-scale feature fusion.
Our PyramidMamba yields state-of-the-art performance on three publicly available datasets.
- Score: 30.522327480291295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic segmentation, as a basic tool for intelligent interpretation of remote sensing images, plays a vital role in many Earth Observation (EO) applications. Nowadays, accurate semantic segmentation of remote sensing images remains a challenge due to the complex spatial-temporal scenes and multi-scale geo-objects. Driven by the wave of deep learning (DL), CNN- and Transformer-based semantic segmentation methods have been explored widely, and these two architectures both revealed the importance of multi-scale feature representation for strengthening semantic information of geo-objects. However, the actual multi-scale feature fusion often comes with the semantic redundancy issue due to homogeneous semantic contents in pyramid features. To handle this issue, we propose a novel Mamba-based segmentation network, namely PyramidMamba. Specifically, we design a plug-and-play decoder, which develops a dense spatial pyramid pooling (DSPP) to encode rich multi-scale semantic features and a pyramid fusion Mamba (PFM) to reduce semantic redundancy in multi-scale feature fusion. Comprehensive ablation experiments illustrate the effectiveness and superiority of the proposed method in enhancing multi-scale feature representation as well as the great potential for real-time semantic segmentation. Moreover, our PyramidMamba yields state-of-the-art performance on three publicly available datasets, i.e. the OpenEarthMap (70.8% mIoU), ISPRS Vaihingen (84.8% mIoU) and Potsdam (88.0% mIoU) datasets. The code will be available at https://github.com/WangLibo1995/GeoSeg.
Related papers
- Deep Multimodal Fusion for Semantic Segmentation of Remote Sensing Earth Observation Data [0.08192907805418582]
This paper proposes a late fusion deep learning model (LF-DLM) for semantic segmentation.
One branch integrates detailed textures from aerial imagery captured by UNetFormer with a Multi-Axis Vision Transformer (ViT) backbone.
The other branch captures complex-temporal dynamics from the Sentinel-2 satellite imageMax time series using a U-ViNet with Temporal Attention (U-TAE)
arXiv Detail & Related papers (2024-10-01T07:50:37Z) - Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images.
Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement.
Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet)
Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z) - MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images [7.0622873873577054]
We propose a novel metadata-collaborative segmentation network (MetaSegNet) for semantic segmentation of remote sensing images.
Unlike the common model structure that only uses unimodal visual data, we extract the key characteristic from freely available remote sensing image metadata.
We construct an image encoder, a text encoder, and a crossmodal attention fusion subnetwork to extract the image and text feature.
arXiv Detail & Related papers (2023-12-20T03:16:34Z) - MCFNet: Multi-scale Covariance Feature Fusion Network for Real-time
Semantic Segmentation [6.0118706234809975]
We propose a new architecture based on Bilateral Network (BiseNet) called Multi-scale Covariance Feature Fusion Network (MCFNet)
Specifically, this network introduces a new feature refinement module and a new feature fusion module.
We evaluate our proposed model on Cityscapes, CamVid datasets and compare it with the state-of-the-art methods.
arXiv Detail & Related papers (2023-12-12T12:20:27Z) - Multi-interactive Feature Learning and a Full-time Multi-modality
Benchmark for Image Fusion and Segmentation [66.15246197473897]
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation.
We propose a textbfMulti-textbfinteractive textbfFeature learning architecture for image fusion and textbfSegmentation.
arXiv Detail & Related papers (2023-08-04T01:03:58Z) - S$^2$-FPN: Scale-ware Strip Attention Guided Feature Pyramid Network for Real-time Semantic Segmentation [6.744210626403423]
This paper presents a new model to achieve a trade-off between accuracy/speed for real-time road scene semantic segmentation.
Specifically, we proposed a lightweight model named Scale-aware Strip Attention Guided Feature Pyramid Network (S$2$-FPN)
Our network consists of three main modules: Attention Pyramid Fusion (APF) module, Scale-aware Strip Attention Module (SSAM), and Global Feature Upsample (GFU) module.
arXiv Detail & Related papers (2022-06-15T05:02:49Z) - Deep Sensor Fusion with Pyramid Fusion Networks for 3D Semantic
Segmentation [0.0]
This work presents a pyramid-based deep fusion architecture for lidar and camera to improve 3D semantic segmentation of traffic scenes.
A novel Pyramid Fusion Backbone fuses feature maps at different scales to compute valuable multimodal, multi-scale features.
The approach is evaluated on two challenging outdoor datasets and different fusion strategies and setups are investigated.
arXiv Detail & Related papers (2022-05-26T20:57:19Z) - AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance.
We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations.
AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - NeuralFusion: Online Depth Fusion in Latent Space [77.59420353185355]
We present a novel online depth map fusion approach that learns depth map aggregation in a latent feature space.
Our approach is real-time capable, handles high noise levels, and is particularly able to deal with gross outliers common for photometric stereo-based depth maps.
arXiv Detail & Related papers (2020-11-30T13:50:59Z) - Cross-layer Feature Pyramid Network for Salient Object Detection [102.20031050972429]
We propose a novel Cross-layer Feature Pyramid Network to improve the progressive fusion in salient object detection.
The distributed features per layer own both semantics and salient details from all other layers simultaneously, and suffer reduced loss of important information.
arXiv Detail & Related papers (2020-02-25T14:06:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.