SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation
- URL: http://arxiv.org/abs/2412.11890v2
- Date: Thu, 27 Mar 2025 14:15:45 GMT
- Title: SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation
- Authors: Yunxiang Fu, Meng Lou, Yizhou Yu,
- Abstract summary: High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction.<n>We introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN, and a decoder based on state space models.
- Score: 45.68176825375723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high-quality local detail encoding, and rich multi-scale feature representation for varying input resolutions. In this paper, we introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN Encoder, and a decoder based on state space models. Specifically, the SegMAN Encoder synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine-grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi-scale context feature extraction and adaptively scales with the input resolution. Our SegMAN-B Encoder achieves 85.1% ImageNet-1k accuracy (+1.5% over VMamba-S with fewer parameters). When paired with our decoder, the full SegMAN-B model achieves 52.6% mIoU on ADE20K (+1.6% over SegNeXt-L with 15% fewer GFLOPs), 83.8% mIoU on Cityscapes (+2.1% over SegFormer-B3 with half the GFLOPs), and 1.6% higher mIoU than VWFormer-B3 on COCO-Stuff with lower GFLOPs. Our code is available at https://github.com/yunxiangfu2001/SegMAN.
Related papers
- LCM: Locally Constrained Compact Point Cloud Model for Masked Point Modeling [47.94285833315427]
We propose a Locally constrained Compact point cloud Model (LCM) consisting of a locally constrained compact encoder and a locally constrained Mamba-based decoder.
Our encoder replaces self-attention with our local aggregation layers to achieve an elegant balance between performance and efficiency.
This decoder ensures linear complexity while maximizing the perception of point cloud geometry information from unmasked patches with higher information density.
arXiv Detail & Related papers (2024-05-27T13:19:23Z) - SCALAR-NeRF: SCAlable LARge-scale Neural Radiance Fields for Scene
Reconstruction [66.69049158826677]
We introduce SCALAR-NeRF, a novel framework tailored for scalable large-scale neural scene reconstruction.
We structure the neural representation as an encoder-decoder architecture, where the encoder processes 3D point coordinates to produce encoded features.
We propose an effective and efficient methodology to fuse the outputs from these local models to attain the final reconstruction.
arXiv Detail & Related papers (2023-11-28T10:18:16Z) - SegViTv2: Exploring Efficient and Continual Semantic Segmentation with
Plain Vision Transformers [76.13755422671822]
This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework.
We introduce a novel Attention-to-Mask (atm) module to design a lightweight decoder effective for plain ViT.
Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about $5%$ of the computational cost.
arXiv Detail & Related papers (2023-06-09T22:29:56Z) - MUSTER: A Multi-scale Transformer-based Decoder for Semantic Segmentation [19.83103856355554]
MUSTER is a transformer-based decoder that seamlessly integrates with hierarchical encoders.
MSKA units enable the fusion of multi-scale features from the encoder and decoder, facilitating comprehensive information integration.
On the challenging ADE20K dataset, our best model achieves a single-scale mIoU of 50.23 and a multi-scale mIoU of 51.88.
arXiv Detail & Related papers (2022-11-25T06:51:07Z) - MALUNet: A Multi-Attention and Light-weight UNet for Skin Lesion
Segmentation [13.456935850832565]
We propose a light-weight model to achieve competitive performances for skin lesion segmentation at the lowest cost of parameters and computational complexity.
We combine four modules with our U-shape architecture and obtain a light-weight medical image segmentation model dubbed as MALUNet.
Compared with UNet, our model improves the mIoU and DSC metrics by 2.39% and 1.49%, respectively, with a 44x and 166x reduction in the number of parameters and computational complexity.
arXiv Detail & Related papers (2022-11-03T13:19:22Z) - An efficient encoder-decoder architecture with top-down attention for
speech separation [25.092542427133704]
We provide a bio-inspired efficient encoder-decoder architecture by mimicking the brain's top-down attention, called TDANet.
On three benchmark datasets, TDANet consistently achieved competitive separation performance to previous state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2022-09-30T03:09:53Z) - SegNeXt: Rethinking Convolutional Attention Design for Semantic
Segmentation [100.89770978711464]
We present SegNeXt, a simple convolutional network architecture for semantic segmentation.
We show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers.
arXiv Detail & Related papers (2022-09-18T14:33:49Z) - An Efficient Multi-Scale Fusion Network for 3D Organ at Risk (OAR)
Segmentation [2.6770199357488242]
We propose a new OAR segmentation framework called OARFocalFuseNet.
It fuses multi-scale features and employs focal modulation for capturing global-local context across multiple scales.
Our best performing method (OARFocalFuseNet) obtained a dice coefficient of 0.7995 and hausdorff distance of 5.1435 on OpenKBP datasets.
arXiv Detail & Related papers (2022-08-15T19:40:18Z) - LegoNN: Building Modular Encoder-Decoder Models [117.47858131603112]
State-of-the-art encoder-decoder models are constructed and trained end-to-end as an atomic unit.
No component of the model can be (re-)used without the others, making it impossible to share parts.
We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for fine-tuning.
arXiv Detail & Related papers (2022-06-07T14:08:07Z) - SegFormer: Simple and Efficient Design for Semantic Segmentation with
Transformers [79.646577541655]
We present SegFormer, a semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders.
SegFormer comprises a novelly structured encoder which outputs multiscale features.
The proposed decoder aggregates information from different layers, and thus combining both local attention and global attention to powerful representations.
arXiv Detail & Related papers (2021-05-31T17:59:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.