Revisiting Multi-Scale Feature Fusion for Semantic Segmentation
- URL: http://arxiv.org/abs/2203.12683v1
- Date: Wed, 23 Mar 2022 19:14:11 GMT
- Title: Revisiting Multi-Scale Feature Fusion for Semantic Segmentation
- Authors: Tianjian Meng, Golnaz Ghiasi, Reza Mahjorian, Quoc V. Le, Mingxing Tan
- Abstract summary: In this paper, we demonstrate that neither high internal resolution nor atrous convolutions are necessary for accurate semantic segmentation.
We develop a simplified segmentation model, named ESeg, which has neither high internal resolution nor expensive atrous convolutions.
Our simple method can achieve better accuracy with faster speed than prior art across multiple datasets.
- Score: 90.32746095413447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is commonly believed that high internal resolution combined with expensive
operations (e.g. atrous convolutions) are necessary for accurate semantic
segmentation, resulting in slow speed and large memory usage. In this paper, we
question this belief and demonstrate that neither high internal resolution nor
atrous convolutions are necessary. Our intuition is that although segmentation
is a dense per-pixel prediction task, the semantics of each pixel often depend
on both nearby neighbors and far-away context; therefore, a more powerful
multi-scale feature fusion network plays a critical role. Following this
intuition, we revisit the conventional multi-scale feature space (typically
capped at P5) and extend it to a much richer space, up to P9, where the
smallest features are only 1/512 of the input size and thus have very large
receptive fields. To process such a rich feature space, we leverage the recent
BiFPN to fuse the multi-scale features. Based on these insights, we develop a
simplified segmentation model, named ESeg, which has neither high internal
resolution nor expensive atrous convolutions. Perhaps surprisingly, our simple
method can achieve better accuracy with faster speed than prior art across
multiple datasets. In real-time settings, ESeg-Lite-S achieves 76.0% mIoU on
CityScapes [12] at 189 FPS, outperforming FasterSeg [9] (73.1% mIoU at 170
FPS). Our ESeg-Lite-L runs at 79 FPS and achieves 80.1% mIoU, largely closing
the gap between real-time and high-performance segmentation models.
Related papers
- Efficient Context Integration through Factorized Pyramidal Learning for
Ultra-Lightweight Semantic Segmentation [1.0499611180329804]
We propose a novel Factorized Pyramidal Learning (FPL) module to aggregate rich contextual information in an efficient manner.
We decompose the spatial pyramid into two stages which enables a simple and efficient feature fusion within the module to solve the notorious checkerboard effect.
Based on the FPL module and FIR unit, we propose an ultra-lightweight real-time network, called FPLNet, which achieves state-of-the-art accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-02-23T05:34:51Z) - Lightweight and Progressively-Scalable Networks for Semantic
Segmentation [100.63114424262234]
Multi-scale learning frameworks have been regarded as a capable class of models to boost semantic segmentation.
In this paper, we thoroughly analyze the design of convolutional blocks and the ways of interactions across multiple scales.
We devise Lightweight and Progressively-Scalable Networks (LPS-Net) that novelly expands the network complexity in a greedy manner.
arXiv Detail & Related papers (2022-07-27T16:00:28Z) - EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense
Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention.
Our multi-scale linear attention achieves the global receptive field and multi-scale learning.
EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z) - Rethinking BiSeNet For Real-time Semantic Segmentation [6.622485130017622]
BiSeNet has been proved to be a popular two-stream network for real-time segmentation.
We propose a novel structure named Short-Term Dense Concatenate network (STDC) by removing structure redundancy.
arXiv Detail & Related papers (2021-04-27T13:49:47Z) - AttaNet: Attention-Augmented Network for Fast and Accurate Scene Parsing [12.409365458889082]
We propose a new model, called Attention-Augmented Network (AttaNet), to capture both global context and multilevel semantics.
AttaNet consists of two primary modules: Strip Attention Module (SAM) and Attention Fusion Module (AFM)
arXiv Detail & Related papers (2021-03-10T08:38:29Z) - Real-time Semantic Segmentation with Fast Attention [94.88466483540692]
We propose a novel architecture for semantic segmentation of high-resolution images and videos in real-time.
The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism.
We show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches.
arXiv Detail & Related papers (2020-07-07T22:37:16Z) - Real-time Semantic Segmentation via Spatial-detail Guided Context
Propagation [49.70144583431999]
We propose the spatial-detail guided context propagation network (SGCPNet) for achieving real-time semantic segmentation.
It uses the spatial details of shallow layers to guide the propagation of the low-resolution global contexts, in which the lost spatial information can be effectively reconstructed.
It achieves 69.5% mIoU segmentation accuracy, while its speed reaches 178.5 FPS on 768x1536 images on a GeForce GTX 1080 Ti GPU card.
arXiv Detail & Related papers (2020-05-22T07:07:26Z) - BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time
Semantic Segmentation [118.46210049742993]
We propose an efficient and effective architecture with a good trade-off between speed and accuracy, termed Bilateral spatial Network (BiSeNet V2)
For a 2,048x1, input, we achieve 72.6% Mean IoU on the Cityscapes test set with a speed of 156 FPS on one NVIDIA GeForce 1080 Ti card, which is significantly faster than existing methods, yet we achieve better segmentation accuracy.
arXiv Detail & Related papers (2020-04-05T10:26:38Z) - FarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale
Context Aggregation and Feature Space Super-resolution [14.226301825772174]
We introduce a novel and efficient module called Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP)
It is a lightweight cascaded structure for Convolutional Neural Networks (CNNs) to efficiently leverage context information.
We achieve 68.4% mIoU at 84 fps on the Cityscapes test set with a single Nivida Titan X (Maxwell) GPU card.
arXiv Detail & Related papers (2020-03-09T03:53:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.