Related papers: U-MixFormer: UNet-like Transformer with Mix-Attention for Efficient Semantic Segmentation

U-MixFormer: UNet-like Transformer with Mix-Attention for Efficient Semantic Segmentation

URL: http://arxiv.org/abs/2312.06272v1
Date: Mon, 11 Dec 2023 10:19:42 GMT
Title: U-MixFormer: UNet-like Transformer with Mix-Attention for Efficient Semantic Segmentation
Authors: Seul-Ki Yeom and Julian von Klitzing
Abstract summary: CNN-based U-Net has seen significant progress in high-resolution medical imaging and remote sensing. This dual success inspired us to merge the strengths of both, leading to the inception of a U-Net-based vision transformer decoder. We propose a novel transformer decoder, U-MixFormer, built upon the U-Net structure, designed for efficient semantic segmentation.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Semantic segmentation has witnessed remarkable advancements with the adaptation of the Transformer architecture. Parallel to the strides made by the Transformer, CNN-based U-Net has seen significant progress, especially in high-resolution medical imaging and remote sensing. This dual success inspired us to merge the strengths of both, leading to the inception of a U-Net-based vision transformer decoder tailored for efficient contextual encoding. Here, we propose a novel transformer decoder, U-MixFormer, built upon the U-Net structure, designed for efficient semantic segmentation. Our approach distinguishes itself from the previous transformer methods by leveraging lateral connections between the encoder and decoder stages as feature queries for the attention modules, apart from the traditional reliance on skip connections. Moreover, we innovatively mix hierarchical feature maps from various encoder and decoder stages to form a unified representation for keys and values, giving rise to our unique mix-attention module. Our approach demonstrates state-of-the-art performance across various configurations. Extensive experiments show that U-MixFormer outperforms SegFormer, FeedFormer, and SegNeXt by a large margin. For example, U-MixFormer-B0 surpasses SegFormer-B0 and FeedFormer-B0 with 3.8% and 2.0% higher mIoU and 27.3% and 21.8% less computation and outperforms SegNext with 3.3% higher mIoU with MSCAN-T encoder on ADE20K. Code available at https://github.com/julian-klitzing/u-mixformer.

Related papers

TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition [71.6546914957701]
We propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network. In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost.
arXiv Detail & Related papers (2023-10-30T09:35:56Z)
SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers [76.13755422671822]
This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework. We introduce a novel Attention-to-Mask (atm) module to design a lightweight decoder effective for plain ViT. Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about $5%$ of the computational cost.
arXiv Detail & Related papers (2023-06-09T22:29:56Z)
Enhancing Medical Image Segmentation with TransCeption: A Multi-Scale Feature Fusion Approach [3.9548535445908928]
CNN-based methods have been the cornerstone of medical image segmentation due to their promising performance and robustness. Transformer-based approaches are currently prevailing since they enlarge the reception field to model global contextual correlation. We propose TransCeption for medical image segmentation, a pure transformer-based U-shape network featured by incorporating the inception-like module into the encoder.
arXiv Detail & Related papers (2023-01-25T22:09:07Z)
MUSTER: A Multi-scale Transformer-based Decoder for Semantic Segmentation [19.83103856355554]
MUSTER is a transformer-based decoder that seamlessly integrates with hierarchical encoders. MSKA units enable the fusion of multi-scale features from the encoder and decoder, facilitating comprehensive information integration. On the challenging ADE20K dataset, our best model achieves a single-scale mIoU of 50.23 and a multi-scale mIoU of 51.88.
arXiv Detail & Related papers (2022-11-25T06:51:07Z)
Fcaformer: Forward Cross Attention in Hybrid Vision Transformer [29.09883780571206]
We propose forward cross attention for hybrid vision transformer (FcaFormer) Our FcaFormer achieves 83.1% top-1 accuracy on Imagenet with only 16.3 million parameters and about 3.6 billion MACs. This saves almost half of the parameters and a few computational costs while achieving 0.7% higher accuracy compared to distilled EfficientFormer.
arXiv Detail & Related papers (2022-11-14T08:43:44Z)
Adaptive Split-Fusion Transformer [90.04885335911729]
We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights. Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
arXiv Detail & Related papers (2022-04-26T10:00:28Z)
nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z)
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [79.646577541655]
We present SegFormer, a semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. SegFormer comprises a novelly structured encoder which outputs multiscale features. The proposed decoder aggregates information from different layers, and thus combining both local attention and global attention to powerful representations.
arXiv Detail & Related papers (2021-05-31T17:59:51Z)
Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation. tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z)
Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition [30.941564693248512]
We investigate various fusion techniques for the all-attention-based encoder-decoder architecture known as the transformer. We introduce a novel multi-encoder learning method that performs a weighted combination of two encoder-decoder multi-head attention outputs only during training. We achieve state-of-the-art performance for transformer-based models on Wall Street Journal with a significant WER reduction of 19% relative compared to the current benchmark approach.
arXiv Detail & Related papers (2021-03-31T21:07:43Z)
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR) SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.