U-MixFormer: UNet-like Transformer with Mix-Attention for Efficient
Semantic Segmentation
- URL: http://arxiv.org/abs/2312.06272v1
- Date: Mon, 11 Dec 2023 10:19:42 GMT
- Title: U-MixFormer: UNet-like Transformer with Mix-Attention for Efficient
Semantic Segmentation
- Authors: Seul-Ki Yeom and Julian von Klitzing
- Abstract summary: CNN-based U-Net has seen significant progress in high-resolution medical imaging and remote sensing.
This dual success inspired us to merge the strengths of both, leading to the inception of a U-Net-based vision transformer decoder.
We propose a novel transformer decoder, U-MixFormer, built upon the U-Net structure, designed for efficient semantic segmentation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semantic segmentation has witnessed remarkable advancements with the
adaptation of the Transformer architecture. Parallel to the strides made by the
Transformer, CNN-based U-Net has seen significant progress, especially in
high-resolution medical imaging and remote sensing. This dual success inspired
us to merge the strengths of both, leading to the inception of a U-Net-based
vision transformer decoder tailored for efficient contextual encoding. Here, we
propose a novel transformer decoder, U-MixFormer, built upon the U-Net
structure, designed for efficient semantic segmentation. Our approach
distinguishes itself from the previous transformer methods by leveraging
lateral connections between the encoder and decoder stages as feature queries
for the attention modules, apart from the traditional reliance on skip
connections. Moreover, we innovatively mix hierarchical feature maps from
various encoder and decoder stages to form a unified representation for keys
and values, giving rise to our unique mix-attention module. Our approach
demonstrates state-of-the-art performance across various configurations.
Extensive experiments show that U-MixFormer outperforms SegFormer, FeedFormer,
and SegNeXt by a large margin. For example, U-MixFormer-B0 surpasses
SegFormer-B0 and FeedFormer-B0 with 3.8% and 2.0% higher mIoU and 27.3% and
21.8% less computation and outperforms SegNext with 3.3% higher mIoU with
MSCAN-T encoder on ADE20K. Code available at
https://github.com/julian-klitzing/u-mixformer.
Related papers
- TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic
Token Mixer for Visual Recognition [71.6546914957701]
We propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way.
We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network.
In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost.
arXiv Detail & Related papers (2023-10-30T09:35:56Z) - SegViTv2: Exploring Efficient and Continual Semantic Segmentation with
Plain Vision Transformers [76.13755422671822]
This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework.
We introduce a novel Attention-to-Mask (atm) module to design a lightweight decoder effective for plain ViT.
Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about $5%$ of the computational cost.
arXiv Detail & Related papers (2023-06-09T22:29:56Z) - Enhancing Medical Image Segmentation with TransCeption: A Multi-Scale
Feature Fusion Approach [3.9548535445908928]
CNN-based methods have been the cornerstone of medical image segmentation due to their promising performance and robustness.
Transformer-based approaches are currently prevailing since they enlarge the reception field to model global contextual correlation.
We propose TransCeption for medical image segmentation, a pure transformer-based U-shape network featured by incorporating the inception-like module into the encoder.
arXiv Detail & Related papers (2023-01-25T22:09:07Z) - MUSTER: A Multi-scale Transformer-based Decoder for Semantic Segmentation [19.83103856355554]
MUSTER is a transformer-based decoder that seamlessly integrates with hierarchical encoders.
MSKA units enable the fusion of multi-scale features from the encoder and decoder, facilitating comprehensive information integration.
On the challenging ADE20K dataset, our best model achieves a single-scale mIoU of 50.23 and a multi-scale mIoU of 51.88.
arXiv Detail & Related papers (2022-11-25T06:51:07Z) - Fcaformer: Forward Cross Attention in Hybrid Vision Transformer [29.09883780571206]
We propose forward cross attention for hybrid vision transformer (FcaFormer)
Our FcaFormer achieves 83.1% top-1 accuracy on Imagenet with only 16.3 million parameters and about 3.6 billion MACs.
This saves almost half of the parameters and a few computational costs while achieving 0.7% higher accuracy compared to distilled EfficientFormer.
arXiv Detail & Related papers (2022-11-14T08:43:44Z) - Adaptive Split-Fusion Transformer [90.04885335911729]
We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights.
Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
arXiv Detail & Related papers (2022-04-26T10:00:28Z) - SegFormer: Simple and Efficient Design for Semantic Segmentation with
Transformers [79.646577541655]
We present SegFormer, a semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders.
SegFormer comprises a novelly structured encoder which outputs multiscale features.
The proposed decoder aggregates information from different layers, and thus combining both local attention and global attention to powerful representations.
arXiv Detail & Related papers (2021-05-31T17:59:51Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z) - Multi-Encoder Learning and Stream Fusion for Transformer-Based
End-to-End Automatic Speech Recognition [30.941564693248512]
We investigate various fusion techniques for the all-attention-based encoder-decoder architecture known as the transformer.
We introduce a novel multi-encoder learning method that performs a weighted combination of two encoder-decoder multi-head attention outputs only during training.
We achieve state-of-the-art performance for transformer-based models on Wall Street Journal with a significant WER reduction of 19% relative compared to the current benchmark approach.
arXiv Detail & Related papers (2021-03-31T21:07:43Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.