DWRSeg: Rethinking Efficient Acquisition of Multi-scale Contextual
Information for Real-time Semantic Segmentation
- URL: http://arxiv.org/abs/2212.01173v3
- Date: Wed, 13 Sep 2023 14:52:30 GMT
- Title: DWRSeg: Rethinking Efficient Acquisition of Multi-scale Contextual
Information for Real-time Semantic Segmentation
- Authors: Haoran Wei, Xu Liu, Shouchun Xu, Zhongjian Dai, Yaping Dai, Xiangyang
Xu
- Abstract summary: We propose a highly efficient multi-scale feature extraction method, which decomposes the original single-step method into two steps, Region Residualization-Semantic Residualization.
We achieve an mIoU of 72.7% on the Cityscapes test set at a speed of 319.5 FPS on one NVIDIA GeForce GTX 1080 Ti card, which exceeds the latest methods of a speed of 69.5 FPS and 0.8% mIoU.
- Score: 10.379708894083217
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many current works directly adopt multi-rate depth-wise dilated convolutions
to capture multi-scale contextual information simultaneously from one input
feature map, thus improving the feature extraction efficiency for real-time
semantic segmentation. However, this design may lead to difficult access to
multi-scale contextual information because of the unreasonable structure and
hyperparameters. To lower the difficulty of drawing multi-scale contextual
information, we propose a highly efficient multi-scale feature extraction
method, which decomposes the original single-step method into two steps, Region
Residualization-Semantic Residualization. In this method, the multi-rate
depth-wise dilated convolutions take a simpler role in feature extraction:
performing simple semantic-based morphological filtering with one desired
receptive field in the second step based on each concise feature map of region
form provided by the first step, to improve their efficiency. Moreover, the
dilation rates and the capacity of dilated convolutions for each network stage
are elaborated to fully utilize all the feature maps of region form that can be
achieved.Accordingly, we design a novel Dilation-wise Residual (DWR) module and
a Simple Inverted Residual (SIR) module for the high and low level network,
respectively, and form a powerful DWR Segmentation (DWRSeg) network. Extensive
experiments on the Cityscapes and CamVid datasets demonstrate the effectiveness
of our method by achieving a state-of-the-art trade-off between accuracy and
inference speed, in addition to being lighter weight. Without pretraining or
resorting to any training trick, we achieve an mIoU of 72.7% on the Cityscapes
test set at a speed of 319.5 FPS on one NVIDIA GeForce GTX 1080 Ti card, which
exceeds the latest methods of a speed of 69.5 FPS and 0.8% mIoU. The code and
trained models are publicly available.
Related papers
- Real-Time Scene Text Detection with Differentiable Binarization and
Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field.
We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network.
An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z) - A Novel Multi-Stage Training Approach for Human Activity Recognition
from Multimodal Wearable Sensor Data Using Deep Neural Network [11.946078871080836]
Deep neural network is an effective choice to automatically recognize human actions utilizing data from various wearable sensors.
In this paper, we have proposed a novel multi-stage training approach that increases diversity in this feature extraction process.
arXiv Detail & Related papers (2021-01-03T20:48:56Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z) - Parameter Sharing Exploration and Hetero-Center based Triplet Loss for
Visible-Thermal Person Re-Identification [17.402673438396345]
This paper focuses on the visible-thermal cross-modality person re-identification (VT Re-ID) task.
Our proposed method distinctly outperforms the state-of-the-art methods by large margins.
arXiv Detail & Related papers (2020-08-14T07:40:35Z) - Real-time Semantic Segmentation with Fast Attention [94.88466483540692]
We propose a novel architecture for semantic segmentation of high-resolution images and videos in real-time.
The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism.
We show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches.
arXiv Detail & Related papers (2020-07-07T22:37:16Z) - MetricUNet: Synergistic Image- and Voxel-Level Learning for Precise CT
Prostate Segmentation via Online Sampling [66.01558025094333]
We propose a two-stage framework, with the first stage to quickly localize the prostate region and the second stage to precisely segment the prostate.
We introduce a novel online metric learning module through voxel-wise sampling in the multi-task network.
Our method can effectively learn more representative voxel-level features compared with the conventional learning methods with cross-entropy or Dice loss.
arXiv Detail & Related papers (2020-05-15T10:37:02Z) - Real-Time High-Performance Semantic Image Segmentation of Urban Street
Scenes [98.65457534223539]
We propose a real-time high-performance DCNN-based method for robust semantic segmentation of urban street scenes.
The proposed method achieves the accuracy of 73.6% and 68.0% mean Intersection over Union (mIoU) with the inference speed of 51.0 fps and 39.3 fps.
arXiv Detail & Related papers (2020-03-11T08:45:53Z) - FarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale
Context Aggregation and Feature Space Super-resolution [14.226301825772174]
We introduce a novel and efficient module called Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP)
It is a lightweight cascaded structure for Convolutional Neural Networks (CNNs) to efficiently leverage context information.
We achieve 68.4% mIoU at 84 fps on the Cityscapes test set with a single Nivida Titan X (Maxwell) GPU card.
arXiv Detail & Related papers (2020-03-09T03:53:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.