MSMD-Net: Deep Stereo Matching with Multi-scale and Multi-dimension Cost
Volume
- URL: http://arxiv.org/abs/2006.12797v2
- Date: Fri, 25 Sep 2020 11:21:08 GMT
- Title: MSMD-Net: Deep Stereo Matching with Multi-scale and Multi-dimension Cost
Volume
- Authors: Zhelun Shen, Yuchao Dai, Zhibo Rao
- Abstract summary: We propose MSMD-Net to construct multi-scale and multi-dimension cost volume.
Our method shows strong domain-across generalization and outperforms best prior work by a margin with three or even five times faster speed.
- Score: 33.07553434167063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep end-to-end learning based stereo matching methods have achieved great
success as witnessed by the leaderboards across different benchmarking datasets
(KITTI, Middlebury, ETH3D, etc). However, real scenarios not only require
approaches to have state-of-the-art performance but also real-time speed and
domain-across generalization, which cannot be satisfied by existing methods. In
this paper, we propose MSMD-Net (Multi-Scale and Multi-Dimension) to construct
multi-scale and multi-dimension cost volume. At the multi-scale level, we
generate four 4D combination volumes at different scales and integrate them
with an encoder-decoder process to predict an initial disparity estimation. At
the multi-dimension level, we additionally construct a 3D warped correlation
volume and use it to refine the initial disparity map with residual learning.
These two dimensional cost volumes are complementary to each other and can
boost the performance of disparity estimation. Additionally, we propose a
switch training strategy to alleviate the overfitting issue appeared in the
pre-training process and further improve the generalization ability and
accuracy of final disparity estimation. Our proposed method was evaluated on
several benchmark datasets and ranked first on KITTI 2012 leaderboard and
second on KITTI 2015 leaderboard as of September 9. In addition, our method
shows strong domain-across generalization and outperforms best prior work by a
noteworthy margin with three or even five times faster speed. The code of
MSMD-Net is available at https://github.com/gallenszl/MSMD-Net.
Related papers
- MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding [64.65145700121442]
We introduce MM-Mixing, a multi-modal mixing alignment framework for 3D understanding.
Our proposed two-stage training pipeline combines feature-level and input-level mixing to optimize the 3D encoder.
We demonstrate that MM-Mixing significantly improves baseline performance across various learning scenarios.
arXiv Detail & Related papers (2024-05-28T18:44:15Z) - FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level
Gradient Calibration [89.4165092674947]
Multi-modality fusion and multi-task learning are becoming trendy in 3D autonomous driving scenario.
Previous works manually coordinate the learning framework with empirical knowledge, which may lead to sub-optima.
We propose a novel yet simple multi-level gradient calibration learning framework across tasks and modalities during optimization.
arXiv Detail & Related papers (2023-07-31T12:50:15Z) - 3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech
recognition [31.992543274210835]
We identify and integrate several approaches to achieve further improvements for ASR tasks.
Specifically, multi-loss refers to the joint CTC/AED loss and multi-path denotes the Mixture-of-Experts(MoE) architecture.
We evaluate our proposed method on the public WenetSpeech dataset and experimental results show that the proposed method provides 12.2%-17.6% relative CER improvement.
arXiv Detail & Related papers (2022-04-07T03:10:49Z) - Curvature-guided dynamic scale networks for Multi-view Stereo [10.667165962654996]
This paper focuses on learning a robust feature extraction network to enhance the performance of matching costs without heavy computation.
We present a dynamic scale feature extraction network, namely, CDSFNet.
It is composed of multiple novel convolution layers, each of which can select a proper patch scale for each pixel guided by the normal curvature of the image surface.
arXiv Detail & Related papers (2021-12-11T14:41:05Z) - IterMVS: Iterative Probability Estimation for Efficient Multi-View
Stereo [71.84742490020611]
IterMVS is a new data-driven method for high-resolution multi-view stereo.
We propose a novel GRU-based estimator that encodes pixel-wise probability distributions of depth in its hidden state.
We verify the efficiency and effectiveness of our method on DTU, Tanks&Temples and ETH3D.
arXiv Detail & Related papers (2021-12-09T18:58:02Z) - Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose
Estimation [61.98690211671168]
We propose a Multi-level Attention-Decoder Network (MAED) to model multi-level attentions in a unified framework.
With the training set of 3DPW, MAED outperforms previous state-of-the-art methods by 6.2, 7.2, and 2.4 mm of PA-MPJPE.
arXiv Detail & Related papers (2021-09-06T09:06:17Z) - 3D Point Cloud Registration with Multi-Scale Architecture and
Self-supervised Fine-tuning [5.629161809575013]
MS-SVConv is a fast multi-scale deep neural network that outputs features from point clouds for 3D registration between two scenes.
We show significant improvements compared to state-of-the-art methods on the competitive and well-known 3DMatch benchmark.
We present a strategy to fine-tune MS-SVConv on unknown datasets in a self-supervised way, which leads to state-of-the-art results on ETH and TUM datasets.
arXiv Detail & Related papers (2021-03-26T15:38:33Z) - Full Matching on Low Resolution for Disparity Estimation [84.45201205560431]
A Multistage Full Matching disparity estimation scheme (MFM) is proposed in this work.
We demonstrate that decouple all similarity scores directly from the low-resolution 4D volume step by step instead of estimating low-resolution 3D cost volume.
Experiment results demonstrate that the proposed method achieves more accurate disparity estimation results and outperforms state-of-the-art methods on Scene Flow, KITTI 2012 and KITTI 2015 datasets.
arXiv Detail & Related papers (2020-12-10T11:11:23Z) - Displacement-Invariant Cost Computation for Efficient Stereo Matching [122.94051630000934]
Deep learning methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy.
But their inference time is typically slow, on the order of seconds for a pair of 540p images.
We propose a emphdisplacement-invariant cost module to compute the matching costs without needing a 4D feature volume.
arXiv Detail & Related papers (2020-12-01T23:58:16Z) - HITNet: Hierarchical Iterative Tile Refinement Network for Real-time
Stereo Matching [18.801346154045138]
HITNet is a novel neural network architecture for real-time stereo matching.
Our architecture is inherently multi-resolution allowing the propagation of information across different levels.
At the time of writing, HITNet ranks 1st-3rd on all the metrics published on the ETH3D website for two view stereo.
arXiv Detail & Related papers (2020-07-23T17:11:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.