Rethinking Dilated Convolution for Real-time Semantic Segmentation
- URL: http://arxiv.org/abs/2111.09957v3
- Date: Mon, 27 Nov 2023 07:46:08 GMT
- Title: Rethinking Dilated Convolution for Real-time Semantic Segmentation
- Authors: Roland Gao
- Abstract summary: We take a different approach by using dilated convolutions with large dilation rates throughout the backbone.
Our model RegSeg achieves competitive results on real-time Cityscapes and CamVid datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The field-of-view is an important metric when designing a model for semantic
segmentation. To obtain a large field-of-view, previous approaches generally
choose to rapidly downsample the resolution, usually with average poolings or
stride 2 convolutions. We take a different approach by using dilated
convolutions with large dilation rates throughout the backbone, allowing the
backbone to easily tune its field-of-view by adjusting its dilation rates, and
show that it's competitive with existing approaches. To effectively use the
dilated convolution, we show a simple upper bound on the dilation rate in order
to not leave gaps in between the convolutional weights, and design an
SE-ResNeXt inspired block structure that uses two parallel $3\times 3$
convolutions with different dilation rates to preserve the local details.
Manually tuning the dilation rates for every block can be difficult, so we also
introduce a differentiable neural architecture search method that uses gradient
descent to optimize the dilation rates. In addition, we propose a lightweight
decoder that restores local information better than common alternatives. To
demonstrate the effectiveness of our approach, our model RegSeg achieves
competitive results on real-time Cityscapes and CamVid datasets. Using a T4 GPU
with mixed precision, RegSeg achieves 78.3 mIOU on Cityscapes test set at $37$
FPS, and 80.9 mIOU on CamVid test set at $112$ FPS, both without ImageNet
pretraining.
Related papers
- AIR-HLoc: Adaptive Retrieved Images Selection for Efficient Visual Localisation [8.789742514363777]
State-of-the-art hierarchical localisation pipelines (HLoc) employ image retrieval (IR) to establish 2D-3D correspondences.
This paper investigates the relationship between global and local descriptors.
We propose an adaptive strategy that adjusts $k$ based on the similarity between the query's global descriptor and those in the database.
arXiv Detail & Related papers (2024-03-27T06:17:21Z) - GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting [51.96353586773191]
We introduce textbfGS-SLAM that first utilizes 3D Gaussian representation in the Simultaneous Localization and Mapping system.
Our method utilizes a real-time differentiable splatting rendering pipeline that offers significant speedup to map optimization and RGB-D rendering.
Our method achieves competitive performance compared with existing state-of-the-art real-time methods on the Replica, TUM-RGBD datasets.
arXiv Detail & Related papers (2023-11-20T12:08:23Z) - FocusTune: Tuning Visual Localization through Focus-Guided Sampling [61.79440120153917]
FocusTune is a focus-guided sampling technique to improve the performance of visual localization algorithms.
We demonstrate that FocusTune both improves or matches state-of-the-art performance whilst keeping ACE's appealing low storage and compute requirements.
This combination of high performance and low compute and storage requirements is particularly promising for applications in areas like mobile robotics and augmented reality.
arXiv Detail & Related papers (2023-11-06T04:58:47Z) - MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation.
We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation.
In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z) - Correlate-and-Excite: Real-Time Stereo Matching via Guided Cost Volume
Excitation [65.83008812026635]
We construct Guided Cost volume Excitation (GCE) and show that simple channel excitation of cost volume guided by image can improve performance considerably.
We present an end-to-end network that we call Correlate-and-Excite (CoEx)
arXiv Detail & Related papers (2021-08-12T14:32:26Z) - Sequential Place Learning: Heuristic-Free High-Performance Long-Term
Place Recognition [24.70946979449572]
We develop a learning-based CNN+LSTM architecture, trainable via backpropagation through time, for viewpoint- and appearance-invariant place recognition.
Our model outperforms 15 classical methods while setting new state-of-the-art performance standards.
In addition, we show that SPL can be up to 70x faster to deploy than classical methods on a 729 km route.
arXiv Detail & Related papers (2021-03-02T22:57:43Z) - Inception Convolution with Efficient Dilation Search [121.41030859447487]
Dilation convolution is a critical mutant of standard convolution neural network to control effective receptive fields and handle large scale variance of objects.
We propose a new mutant of dilated convolution, namely inception (dilated) convolution where the convolutions have independent dilation among different axes, channels and layers.
We explore a practical method for fitting the complex inception convolution to the data, a simple while effective dilation search algorithm(EDO) based on statistical optimization is developed.
arXiv Detail & Related papers (2020-12-25T14:58:35Z) - Displacement-Invariant Cost Computation for Efficient Stereo Matching [122.94051630000934]
Deep learning methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy.
But their inference time is typically slow, on the order of seconds for a pair of 540p images.
We propose a emphdisplacement-invariant cost module to compute the matching costs without needing a 4D feature volume.
arXiv Detail & Related papers (2020-12-01T23:58:16Z) - AANet: Adaptive Aggregation Network for Efficient Stereo Matching [33.39794232337985]
Current state-of-the-art stereo models are mostly based on costly 3D convolutions.
We propose a sparse points based intra-scale cost aggregation method to alleviate the edge-fattening issue.
We also approximate traditional cross-scale cost aggregation algorithm with neural network layers to handle large textureless regions.
arXiv Detail & Related papers (2020-04-20T18:07:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.