SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation
- URL: http://arxiv.org/abs/2412.11890v1
- Date: Mon, 16 Dec 2024 15:38:25 GMT
- Title: SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation
- Authors: Yunxiang Fu, Meng Lou, Yizhou Yu,
- Abstract summary: High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction.
We introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN, and a decoder based on state space models.
We comprehensively evaluate SegMAN on three challenging datasets: ADE20K, Cityscapes, and COCO-Stuff.
- Score: 45.68176825375723
- License:
- Abstract: High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high-quality local detail encoding, and rich multi-scale feature representation for varying input resolutions. In this paper, we introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN Encoder, and a decoder based on state space models. Specifically, the SegMAN Encoder synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine-grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi-scale context feature extraction and adaptively scales with the input resolution. We comprehensively evaluate SegMAN on three challenging datasets: ADE20K, Cityscapes, and COCO-Stuff. For instance, SegMAN-B achieves 52.6% mIoU on ADE20K, outperforming SegNeXt-L by 1.6% mIoU while reducing computational complexity by over 15% GFLOPs. On Cityscapes, SegMAN-B attains 83.8% mIoU, surpassing SegFormer-B3 by 2.1% mIoU with approximately half the GFLOPs. Similarly, SegMAN-B improves upon VWFormer-B3 by 1.6% mIoU with lower GFLOPs on the COCO-Stuff dataset. Our code is available at https://github.com/yunxiangfu2001/SegMAN.
Related papers
- SCALAR-NeRF: SCAlable LARge-scale Neural Radiance Fields for Scene
Reconstruction [66.69049158826677]
We introduce SCALAR-NeRF, a novel framework tailored for scalable large-scale neural scene reconstruction.
We structure the neural representation as an encoder-decoder architecture, where the encoder processes 3D point coordinates to produce encoded features.
We propose an effective and efficient methodology to fuse the outputs from these local models to attain the final reconstruction.
arXiv Detail & Related papers (2023-11-28T10:18:16Z) - MALUNet: A Multi-Attention and Light-weight UNet for Skin Lesion
Segmentation [13.456935850832565]
We propose a light-weight model to achieve competitive performances for skin lesion segmentation at the lowest cost of parameters and computational complexity.
We combine four modules with our U-shape architecture and obtain a light-weight medical image segmentation model dubbed as MALUNet.
Compared with UNet, our model improves the mIoU and DSC metrics by 2.39% and 1.49%, respectively, with a 44x and 166x reduction in the number of parameters and computational complexity.
arXiv Detail & Related papers (2022-11-03T13:19:22Z) - SegNeXt: Rethinking Convolutional Attention Design for Semantic
Segmentation [100.89770978711464]
We present SegNeXt, a simple convolutional network architecture for semantic segmentation.
We show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers.
arXiv Detail & Related papers (2022-09-18T14:33:49Z) - An Efficient Multi-Scale Fusion Network for 3D Organ at Risk (OAR)
Segmentation [2.6770199357488242]
We propose a new OAR segmentation framework called OARFocalFuseNet.
It fuses multi-scale features and employs focal modulation for capturing global-local context across multiple scales.
Our best performing method (OARFocalFuseNet) obtained a dice coefficient of 0.7995 and hausdorff distance of 5.1435 on OpenKBP datasets.
arXiv Detail & Related papers (2022-08-15T19:40:18Z) - LegoNN: Building Modular Encoder-Decoder Models [117.47858131603112]
State-of-the-art encoder-decoder models are constructed and trained end-to-end as an atomic unit.
No component of the model can be (re-)used without the others, making it impossible to share parts.
We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for fine-tuning.
arXiv Detail & Related papers (2022-06-07T14:08:07Z) - SegFormer: Simple and Efficient Design for Semantic Segmentation with
Transformers [79.646577541655]
We present SegFormer, a semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders.
SegFormer comprises a novelly structured encoder which outputs multiscale features.
The proposed decoder aggregates information from different layers, and thus combining both local attention and global attention to powerful representations.
arXiv Detail & Related papers (2021-05-31T17:59:51Z) - Scaling Semantic Segmentation Beyond 1K Classes on a Single GPU [87.48110331544885]
We propose a novel training methodology to train and scale the existing semantic segmentation models.
We demonstrate a clear benefit of our approach on a dataset with 1284 classes, bootstrapped from LVIS and COCO annotations, with three times better mIoU than the DeeplabV3+ model.
arXiv Detail & Related papers (2020-12-14T13:12:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.