Accurate and Efficient Stereo Matching via Attention Concatenation
Volume
- URL: http://arxiv.org/abs/2209.12699v3
- Date: Mon, 20 Nov 2023 06:26:47 GMT
- Title: Accurate and Efficient Stereo Matching via Attention Concatenation
Volume
- Authors: Gangwei Xu, Yun Wang, Junda Cheng, Jinhui Tang, Xin Yang
- Abstract summary: We present a novel cost volume construction method, named attention concatenation volume (ACV)
ACV generates attention weights from correlation clues to suppress redundant information and enhance matching-related information in the concatenation volume.
We further design a fast version of ACV to enable real-time performance, named Fast-ACV, which generates high likelihood disparity hypotheses.
- Score: 33.615312186946866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stereo matching is a fundamental building block for many vision and robotics
applications. An informative and concise cost volume representation is vital
for stereo matching of high accuracy and efficiency. In this paper, we present
a novel cost volume construction method, named attention concatenation volume
(ACV), which generates attention weights from correlation clues to suppress
redundant information and enhance matching-related information in the
concatenation volume. The ACV can be seamlessly embedded into most stereo
matching networks, the resulting networks can use a more lightweight
aggregation network and meanwhile achieve higher accuracy. We further design a
fast version of ACV to enable real-time performance, named Fast-ACV, which
generates high likelihood disparity hypotheses and the corresponding attention
weights from low-resolution correlation clues to significantly reduce
computational and memory cost and meanwhile maintain a satisfactory accuracy.
The core idea of our Fast-ACV is volume attention propagation (VAP) which can
automatically select accurate correlation values from an upsampled correlation
volume and propagate these accurate values to the surroundings pixels with
ambiguous correlation clues. Furthermore, we design a highly accurate network
ACVNet and a real-time network Fast-ACVNet based on our ACV and Fast-ACV
respectively, which achieve the state-of-the-art performance on several
benchmarks (i.e., our ACVNet ranks the 2nd on KITTI 2015 and Scene Flow, and
the 3rd on KITTI 2012 and ETH3D among all the published methods; our
Fast-ACVNet outperforms almost all state-of-the-art real-time methods on Scene
Flow, KITTI 2012 and 2015 and meanwhile has better generalization ability)
Related papers
- FasterViT: Fast Vision Transformers with Hierarchical Attention [63.50580266223651]
We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications.
Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs.
arXiv Detail & Related papers (2023-06-09T18:41:37Z) - Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative
Latent Attention [100.81495948184649]
We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text.
Our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models.
arXiv Detail & Related papers (2022-11-21T18:22:39Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - ACVNet: Attention Concatenation Volume for Accurate and Efficient Stereo
Matching [7.39503547452922]
We present a novel cost volume construction method which generates attention weights from correlation clues to suppress redundant information.
To generate reliable attention weights, we propose multi-level adaptive patch matching to improve the distinctiveness of the matching cost.
The proposed cost volume is named attention concatenation volume (ACV) which can be seamlessly embedded into most stereo matching networks.
arXiv Detail & Related papers (2022-03-04T06:28:58Z) - Multi-scale Iterative Residuals for Fast and Scalable Stereo Matching [13.76996108304056]
This paper presents an iterative multi-scale coarse-to-fine refinement (iCFR) framework to bridge this gap.
We use multi-scale warped features to estimate disparity residuals and push the disparity search range in the cost volume to a minimum limit.
Finally, we apply a refinement network to recover the loss of precision which is inherent in multi-scale approaches.
arXiv Detail & Related papers (2021-10-25T09:54:17Z) - Correlate-and-Excite: Real-Time Stereo Matching via Guided Cost Volume
Excitation [65.83008812026635]
We construct Guided Cost volume Excitation (GCE) and show that simple channel excitation of cost volume guided by image can improve performance considerably.
We present an end-to-end network that we call Correlate-and-Excite (CoEx)
arXiv Detail & Related papers (2021-08-12T14:32:26Z) - SCV-Stereo: Learning Stereo Matching from a Sparse Cost Volume [14.801038005597855]
Convolutional neural network (CNN)-based stereo matching approaches generally require a dense cost volume (DCV) for disparity estimation.
We propose SCV-Stereo, a novel CNN architecture, capable of learning dense stereo matching from sparse cost volume representations.
Our inspiration is derived from the fact that DCV representations are somewhat redundant and can be replaced with SCV representations.
arXiv Detail & Related papers (2021-07-17T05:45:44Z) - ES-Net: An Efficient Stereo Matching Network [4.8986598953553555]
Existing stereo matching networks typically use slow and computationally expensive 3D convolutions to improve the performance.
We propose the Efficient Stereo Network (ESNet), which achieves high performance and efficient inference at the same time.
arXiv Detail & Related papers (2021-03-05T20:11:39Z) - Multi-Task Network for Noise-Robust Keyword Spotting and Speaker
Verification using CTC-based Soft VAD and Global Query Attention [13.883985850789443]
Keywords spotting (KWS) and speaker verification (SV) have been studied independently but acoustic and speaker domains are complementary.
We propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information.
arXiv Detail & Related papers (2020-05-08T05:58:46Z) - Toward fast and accurate human pose estimation via soft-gated skip
connections [97.06882200076096]
This paper is on highly accurate and highly efficient human pose estimation.
We re-analyze this design choice in the context of improving both the accuracy and the efficiency over the state-of-the-art.
Our model achieves state-of-the-art results on the MPII and LSP datasets.
arXiv Detail & Related papers (2020-02-25T18:51:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.