RTFormer: Efficient Design for Real-Time Semantic Segmentation with
Transformer
- URL: http://arxiv.org/abs/2210.07124v1
- Date: Thu, 13 Oct 2022 16:03:53 GMT
- Title: RTFormer: Efficient Design for Real-Time Semantic Segmentation with
Transformer
- Authors: Jian Wang, Chenhui Gou, Qiman Wu, Haocheng Feng, Junyu Han, Errui
Ding, Jingdong Wang
- Abstract summary: We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation.
It achieves better trade-off between performance and efficiency than CNN-based models.
Experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer.
- Score: 63.25665813125223
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, transformer-based networks have shown impressive results in
semantic segmentation. Yet for real-time semantic segmentation, pure CNN-based
approaches still dominate in this field, due to the time-consuming computation
mechanism of transformer. We propose RTFormer, an efficient dual-resolution
transformer for real-time semantic segmenation, which achieves better trade-off
between performance and efficiency than CNN-based models. To achieve high
inference efficiency on GPU-like devices, our RTFormer leverages GPU-Friendly
Attention with linear complexity and discards the multi-head mechanism.
Besides, we find that cross-resolution attention is more efficient to gather
global context information for high-resolution branch by spreading the high
level knowledge learned from low-resolution branch. Extensive experiments on
mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer,
it achieves state-of-the-art on Cityscapes, CamVid and COCOStuff, and shows
promising results on ADE20K. Code is available at PaddleSeg:
https://github.com/PaddlePaddle/PaddleSeg.
Related papers
- CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes [0.0]
multimodal semantic segmentation methods suffer from high computational complexity and low inference speed.
We propose the Cosine Similarity Fusion Network (CSFNet) as a real-time RGB-X semantic segmentation model.
CSFNet has competitive accuracy with state-of-the-art methods while being state-of-the-art in terms of speed.
arXiv Detail & Related papers (2024-07-01T14:34:32Z) - Efficient Remote Sensing Segmentation With Generative Adversarial
Transformer [5.728847418491545]
This paper proposes an efficient Generative Adversarial Transfomer (GATrans) for achieving high-precision semantic segmentation.
The framework utilizes a Global Transformer Network (GTNet) as the generator, efficiently extracting multi-level features.
We validate the effectiveness of our approach through extensive experiments on the Vaihingen dataset, achieving an average F1 score of 90.17% and an overall accuracy of 91.92%.
arXiv Detail & Related papers (2023-10-02T15:46:59Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Distortion-Aware Network Pruning and Feature Reuse for Real-time Video
Segmentation [49.17930380106643]
We propose a novel framework to speed up any architecture with skip-connections for real-time vision tasks.
Specifically, at the arrival of each frame, we transform the features from the previous frame to reuse them at specific spatial bins.
We then perform partial computation of the backbone network on the regions of the current frame that captures temporal differences between the current and previous frame.
arXiv Detail & Related papers (2022-06-20T07:20:02Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - Lawin Transformer: Improving Semantic Segmentation Transformer with
Multi-Scale Representations via Large Window Attention [16.75003034164463]
Multi-scale representations are crucial for semantic segmentation.
In this paper, we introduce multi-scale representations into semantic segmentation ViT via window attention mechanism.
Our resulting ViT, Lawin Transformer, is composed of an efficient vision transformer (HVT) as encoder and a LawinASPP as decoder.
arXiv Detail & Related papers (2022-01-05T13:51:20Z) - Real-time Semantic Segmentation with Fast Attention [94.88466483540692]
We propose a novel architecture for semantic segmentation of high-resolution images and videos in real-time.
The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism.
We show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches.
arXiv Detail & Related papers (2020-07-07T22:37:16Z) - Real-Time High-Performance Semantic Image Segmentation of Urban Street
Scenes [98.65457534223539]
We propose a real-time high-performance DCNN-based method for robust semantic segmentation of urban street scenes.
The proposed method achieves the accuracy of 73.6% and 68.0% mean Intersection over Union (mIoU) with the inference speed of 51.0 fps and 39.3 fps.
arXiv Detail & Related papers (2020-03-11T08:45:53Z) - FarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale
Context Aggregation and Feature Space Super-resolution [14.226301825772174]
We introduce a novel and efficient module called Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP)
It is a lightweight cascaded structure for Convolutional Neural Networks (CNNs) to efficiently leverage context information.
We achieve 68.4% mIoU at 84 fps on the Cityscapes test set with a single Nivida Titan X (Maxwell) GPU card.
arXiv Detail & Related papers (2020-03-09T03:53:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.