Learning Content-enhanced Mask Transformer for Domain Generalized
Urban-Scene Segmentation
- URL: http://arxiv.org/abs/2307.00371v5
- Date: Sun, 17 Dec 2023 15:50:36 GMT
- Title: Learning Content-enhanced Mask Transformer for Domain Generalized
Urban-Scene Segmentation
- Authors: Qi Bi, Shaodi You, Theo Gevers
- Abstract summary: Domain-generalized urban-scene semantic segmentation (USSS) aims to learn generalized semantic predictions across diverse urban-scene styles.
Existing approaches typically rely on convolutional neural networks (CNNs) to learn the content of urban scenes.
We propose a Content-enhanced Mask TransFormer (CMFormer) for domain-generalized USSS.
- Score: 28.165600284392042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Domain-generalized urban-scene semantic segmentation (USSS) aims to learn
generalized semantic predictions across diverse urban-scene styles. Unlike
domain gap challenges, USSS is unique in that the semantic categories are often
similar in different urban scenes, while the styles can vary significantly due
to changes in urban landscapes, weather conditions, lighting, and other
factors. Existing approaches typically rely on convolutional neural networks
(CNNs) to learn the content of urban scenes.
In this paper, we propose a Content-enhanced Mask TransFormer (CMFormer) for
domain-generalized USSS. The main idea is to enhance the focus of the
fundamental component, the mask attention mechanism, in Transformer
segmentation models on content information. To achieve this, we introduce a
novel content-enhanced mask attention mechanism. It learns mask queries from
both the image feature and its down-sampled counterpart, as lower-resolution
image features usually contain more robust content information and are less
sensitive to style variations. These features are fused into a Transformer
decoder and integrated into a multi-resolution content-enhanced mask attention
learning scheme.
Extensive experiments conducted on various domain-generalized urban-scene
segmentation datasets demonstrate that the proposed CMFormer significantly
outperforms existing CNN-based methods for domain-generalized semantic
segmentation, achieving improvements of up to 14.00\% in terms of mIoU (mean
intersection over union). The source code is publicly available at
\url{https://github.com/BiQiWHU/CMFormer}.
Related papers
- Learning Spectral-Decomposed Tokens for Domain Generalized Semantic Segmentation [38.0401463751139]
We present a novel Spectral-dEcomposed Token (SET) learning framework to advance the frontier.
Particularly, the frozen VFM features are first decomposed into the phase and amplitude components in the frequency space.
We develop an attention optimization method to bridge the gap between style-affected representation and static tokens during inference.
arXiv Detail & Related papers (2024-07-26T07:50:48Z) - FANet: Feature Amplification Network for Semantic Segmentation in Cluttered Background [9.970265640589966]
Existing deep learning approaches leave out the semantic cues that are crucial in semantic segmentation present in complex scenarios.
We propose a feature amplification network (FANet) as a backbone network that incorporates semantic information using a novel feature enhancement module at multi-stages.
Our experimental results demonstrate the state-of-the-art performance compared to existing methods.
arXiv Detail & Related papers (2024-07-12T15:57:52Z) - Intra-Source Style Augmentation for Improved Domain Generalization [21.591831983223997]
We propose an intra-source style augmentation (ISSA) method to improve domain generalization in semantic segmentation.
ISSA is model-agnostic and straightforwardly applicable with CNNs and Transformers.
It is also complementary to other domain generalization techniques, e.g., it improves the recent state-of-the-art solution RobustNet by $3%$ mIoU in Cityscapes to Dark Z"urich.
arXiv Detail & Related papers (2022-10-18T21:33:25Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - SemAffiNet: Semantic-Affine Transformation for Point Cloud Segmentation [94.11915008006483]
We propose SemAffiNet for point cloud semantic segmentation.
We conduct extensive experiments on the ScanNetV2 and NYUv2 datasets.
arXiv Detail & Related papers (2022-05-26T17:00:23Z) - AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance.
We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations.
AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z) - SeMask: Semantically Masked Transformers for Semantic Segmentation [10.15763397352378]
SeMask is a framework that incorporates semantic information into the encoder with the help of a semantic attention operation.
Our framework achieves a new state-of-the-art of 58.22% mIoU on the ADE20K dataset and improvements of over 3% in the mIoU metric on the Cityscapes dataset.
arXiv Detail & Related papers (2021-12-23T18:56:02Z) - Efficient Hybrid Transformer: Learning Global-local Context for Urban
Sence Segmentation [11.237929167356725]
We propose an efficient hybrid Transformer (EHT) for semantic segmentation of urban scene images.
EHT takes advantage of CNNs and Transformer, learning global-local context to strengthen the feature representation.
The proposed EHT achieves a 67.0% mIoU on the UAVid test set and outperforms other lightweight models significantly.
arXiv Detail & Related papers (2021-09-18T13:55:38Z) - Semantic Attention and Scale Complementary Network for Instance
Segmentation in Remote Sensing Images [54.08240004593062]
We propose an end-to-end multi-category instance segmentation model, which consists of a Semantic Attention (SEA) module and a Scale Complementary Mask Branch (SCMB)
SEA module contains a simple fully convolutional semantic segmentation branch with extra supervision to strengthen the activation of interest instances on the feature map.
SCMB extends the original single mask branch to trident mask branches and introduces complementary mask supervision at different scales.
arXiv Detail & Related papers (2021-07-25T08:53:59Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - DoFE: Domain-oriented Feature Embedding for Generalizable Fundus Image
Segmentation on Unseen Datasets [96.92018649136217]
We present a novel Domain-oriented Feature Embedding (DoFE) framework to improve the generalization ability of CNNs on unseen target domains.
Our DoFE framework dynamically enriches the image features with additional domain prior knowledge learned from multi-source domains.
Our framework generates satisfying segmentation results on unseen datasets and surpasses other domain generalization and network regularization methods.
arXiv Detail & Related papers (2020-10-13T07:28:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.