MacFormer: Semantic Segmentation with Fine Object Boundaries
- URL: http://arxiv.org/abs/2408.05699v1
- Date: Sun, 11 Aug 2024 05:36:10 GMT
- Title: MacFormer: Semantic Segmentation with Fine Object Boundaries
- Authors: Guoan Xu, Wenfeng Huang, Tao Wu, Ligeng Chen, Wenjing Jia, Guangwei Gao, Xiatian Zhu, Stuart Perry,
- Abstract summary: We introduce a new semantic segmentation architecture, MacFormer'', which features two key components.
Firstly, using learnable agent tokens, a Mutual Agent Cross-Attention (MACA) mechanism effectively facilitates the bidirectional integration of features across encoder and decoder layers.
Secondly, a Frequency Enhancement Module (FEM) in the decoder leverages high-frequency and low-frequency components to boost features in the frequency domain.
MacFormer is demonstrated to be compatible with various network architectures and outperforms existing methods in both accuracy and efficiency on datasets benchmark ADE20K and Cityscapes.
- Score: 38.430631361558426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic segmentation involves assigning a specific category to each pixel in an image. While Vision Transformer-based models have made significant progress, current semantic segmentation methods often struggle with precise predictions in localized areas like object boundaries. To tackle this challenge, we introduce a new semantic segmentation architecture, ``MacFormer'', which features two key components. Firstly, using learnable agent tokens, a Mutual Agent Cross-Attention (MACA) mechanism effectively facilitates the bidirectional integration of features across encoder and decoder layers. This enables better preservation of low-level features, such as elementary edges, during decoding. Secondly, a Frequency Enhancement Module (FEM) in the decoder leverages high-frequency and low-frequency components to boost features in the frequency domain, benefiting object boundaries with minimal computational complexity increase. MacFormer is demonstrated to be compatible with various network architectures and outperforms existing methods in both accuracy and efficiency on benchmark datasets ADE20K and Cityscapes under different computational constraints.
Related papers
- SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation [37.2240333333522]
Vision Transformer (ViT) has achieved notable success in computer vision, with its variants extensively validated across various downstream tasks, including semantic segmentation.
This paper proposes Strip Cross-Attention (SCASeg), an innovative decoder head explicitly designed for semantic segmentation.
arXiv Detail & Related papers (2024-11-26T03:00:09Z) - MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping [1.1557852082644071]
Few-shot Semantic addresses the challenge of segmenting objects in query images with only a handful of examples.
We propose a new Few-shot Semantic framework based on the transformer architecture.
Our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies.
arXiv Detail & Related papers (2024-09-17T16:14:03Z) - Mobile-Seed: Joint Semantic Segmentation and Boundary Detection for
Mobile Robots [17.90723909170376]
We introduce Mobile-Seed, a lightweight framework for simultaneous semantic segmentation and boundary detection.
Our framework features a two-stream encoder, an active fusion decoder (AFD) and a dual-task regularization approach.
Experiments on the Cityscapes dataset have shown that Mobile-Seed achieves notable improvement over the state-of-the-art (SOTA) baseline.
arXiv Detail & Related papers (2023-11-21T14:53:02Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - LENet: Lightweight And Efficient LiDAR Semantic Segmentation Using
Multi-Scale Convolution Attention [0.0]
We propose a projection-based semantic segmentation network called LENet with an encoder-decoder structure for LiDAR-based semantic segmentation.
The encoder is composed of a novel multi-scale convolutional attention (MSCA) module with varying receptive field sizes to capture features.
We show that our proposed method is lighter, more efficient, and robust compared to state-of-the-art semantic segmentation methods.
arXiv Detail & Related papers (2023-01-11T02:51:38Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - Boundary-Aware Segmentation Network for Mobile and Web Applications [60.815545591314915]
Boundary-Aware Network (BASNet) is integrated with a predict-refine architecture and a hybrid loss for highly accurate image segmentation.
BASNet runs at over 70 fps on a single GPU which benefits many potential real applications.
Based on BASNet, we further developed two (close to) commercial applications: AR COPY & PASTE, in which BASNet is augmented reality for "COPY" and "PASTING" real-world objects, and OBJECT CUT, which is a web-based tool for automatic object background removal.
arXiv Detail & Related papers (2021-01-12T19:20:26Z) - A Holistically-Guided Decoder for Deep Representation Learning with
Applications to Semantic Segmentation and Object Detection [74.88284082187462]
One common strategy is to adopt dilated convolutions in the backbone networks to extract high-resolution feature maps.
We propose one novel holistically-guided decoder which is introduced to obtain the high-resolution semantic-rich feature maps.
arXiv Detail & Related papers (2020-12-18T10:51:49Z) - CARAFE++: Unified Content-Aware ReAssembly of FEatures [132.49582482421246]
We propose unified Content-Aware ReAssembly of FEatures (CARAFE++), a universal, lightweight and highly effective operator to fulfill this goal.
CARAFE++ generates adaptive kernels on-the-fly to enable instance-specific content-aware handling.
It shows consistent and substantial gains across all the tasks with negligible computational overhead.
arXiv Detail & Related papers (2020-12-07T07:34:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.