Revisiting Stereo Depth Estimation From a Sequence-to-Sequence
Perspective with Transformers
- URL: http://arxiv.org/abs/2011.02910v4
- Date: Wed, 25 Aug 2021 18:35:45 GMT
- Title: Revisiting Stereo Depth Estimation From a Sequence-to-Sequence
Perspective with Transformers
- Authors: Zhaoshuo Li, Xingtong Liu, Nathan Drenkow, Andy Ding, Francis X.
Creighton, Russell H. Taylor, Mathias Unberath
- Abstract summary: Stereo depth estimation relies on optimal correspondence matching between pixels on epipolar lines in the left and right images to infer depth.
In this work, we revisit the problem from a sequence-to-sequence correspondence perspective to replace cost volume construction with dense pixel matching using position information and attention.
We report promising results on both synthetic and real-world datasets and demonstrate that STTR generalizes across different domains, even without fine-tuning.
- Score: 11.669086751865091
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stereo depth estimation relies on optimal correspondence matching between
pixels on epipolar lines in the left and right images to infer depth. In this
work, we revisit the problem from a sequence-to-sequence correspondence
perspective to replace cost volume construction with dense pixel matching using
position information and attention. This approach, named STereo TRansformer
(STTR), has several advantages: It 1) relaxes the limitation of a fixed
disparity range, 2) identifies occluded regions and provides confidence
estimates, and 3) imposes uniqueness constraints during the matching process.
We report promising results on both synthetic and real-world datasets and
demonstrate that STTR generalizes across different domains, even without
fine-tuning.
Related papers
- Quantity-Aware Coarse-to-Fine Correspondence for Image-to-Point Cloud
Registration [4.954184310509112]
Image-to-point cloud registration aims to determine the relative camera pose between an RGB image and a reference point cloud.
Matching individual points with pixels can be inherently ambiguous due to modality gaps.
We propose a framework to capture quantity-aware correspondences between local point sets and pixel patches.
arXiv Detail & Related papers (2023-07-14T03:55:54Z) - Explicit Correspondence Matching for Generalizable Neural Radiance
Fields [49.49773108695526]
We present a new NeRF method that is able to generalize to new unseen scenarios and perform novel view synthesis with as few as two source views.
The explicit correspondence matching is quantified with the cosine similarity between image features sampled at the 2D projections of a 3D point on different views.
Our method achieves state-of-the-art results on different evaluation settings, with the experiments showing a strong correlation between our learned cosine feature similarity and volume density.
arXiv Detail & Related papers (2023-04-24T17:46:01Z) - Evaluation of a Canonical Image Representation for Sidescan Sonar [4.961559590556073]
Sidescan sonar (SSS) detects a wide range and provides photo-realistic images in high resolution.
SSS projects the 3D seafloor to 2D images, which are distorted by the AUV's altitude, target's range and sensor's resolution.
In this paper, a canonical transformation method consisting of intensity correction and slant range correction is proposed to decrease the above distortion.
arXiv Detail & Related papers (2023-04-18T19:08:12Z) - Rectifying homographies for stereo vision: analytical solution for
minimal distortion [0.0]
Rectification is used to simplify the subsequent stereo correspondence problem.
This work proposes a closed-form solution for the rectifying homographies that minimise perspective distortion.
arXiv Detail & Related papers (2022-02-28T22:35:47Z) - LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution
Homography Estimation [52.63874513999119]
Cross-resolution image alignment is a key problem in multiscale giga photography.
Existing deep homography methods neglecting the explicit formulation of correspondences between them, which leads to degraded accuracy in cross-resolution challenges.
We propose a local transformer network embedded within a multiscale structure to explicitly learn correspondences between the multimodal inputs.
arXiv Detail & Related papers (2021-06-08T02:51:45Z) - TFill: Image Completion via a Transformer-Based Architecture [69.62228639870114]
We propose treating image completion as a directionless sequence-to-sequence prediction task.
We employ a restrictive CNN with small and non-overlapping RF for token representation.
In a second phase, to improve appearance consistency between visible and generated regions, a novel attention-aware layer (AAL) is introduced.
arXiv Detail & Related papers (2021-04-02T01:42:01Z) - Fusion of Range and Stereo Data for High-Resolution Scene-Modeling [20.824550995195057]
This paper addresses the problem of range-stereo fusion, for the construction of high-resolution depth maps.
We combine low-resolution depth data with high-resolution stereo data, in a maximum a posteriori (MAP) formulation.
The accuracy of the method is not compromised, owing to three properties of the data-term in the energy function.
arXiv Detail & Related papers (2020-12-12T09:37:42Z) - StereoGAN: Bridging Synthetic-to-Real Domain Gap by Joint Optimization
of Domain Translation and Stereo Matching [56.95846963856928]
Large-scale synthetic datasets are beneficial to stereo matching but usually introduce known domain bias.
We propose an end-to-end training framework with domain translation and stereo matching networks to tackle this challenge.
arXiv Detail & Related papers (2020-05-05T03:11:38Z) - RANSAC-Flow: generic two-stage image alignment [53.11926395028508]
We show that a simple unsupervised approach performs surprisingly well across a range of tasks.
Despite its simplicity, our method shows competitive results on a range of tasks and datasets.
arXiv Detail & Related papers (2020-04-03T12:37:58Z) - Deep Semantic Matching with Foreground Detection and Cycle-Consistency [103.22976097225457]
We address weakly supervised semantic matching based on a deep network.
We explicitly estimate the foreground regions to suppress the effect of background clutter.
We develop cycle-consistent losses to enforce the predicted transformations across multiple images to be geometrically plausible and consistent.
arXiv Detail & Related papers (2020-03-31T22:38:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.