WT-MVSNet: Window-based Transformers for Multi-view Stereo
- URL: http://arxiv.org/abs/2205.14319v1
- Date: Sat, 28 May 2022 03:32:09 GMT
- Title: WT-MVSNet: Window-based Transformers for Multi-view Stereo
- Authors: Jinli Liao, Yikang Ding, Yoli Shavit, Dihe Huang, Shihao Ren, Jia Guo,
Wensen Feng, Kai Zhang
- Abstract summary: We introduce a Window-based Epipolar Transformer (WET) which reduces matching redundancy by using epipolar constraints.
A second Shifted WT is employed for aggregating global information within cost volume.
We present a novel Cost Transformer (CT) to replace 3D convolutions for cost volume regularization.
- Score: 12.25150988628149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Transformers were shown to enhance the performance of multi-view
stereo by enabling long-range feature interaction. In this work, we propose
Window-based Transformers (WT) for local feature matching and global feature
aggregation in multi-view stereo. We introduce a Window-based Epipolar
Transformer (WET) which reduces matching redundancy by using epipolar
constraints. Since point-to-line matching is sensitive to erroneous camera pose
and calibration, we match windows near the epipolar lines. A second Shifted WT
is employed for aggregating global information within cost volume. We present a
novel Cost Transformer (CT) to replace 3D convolutions for cost volume
regularization. In order to better constrain the estimated depth maps from
multiple views, we further design a novel geometric consistency loss (Geo Loss)
which punishes unreliable areas where multi-view consistency is not satisfied.
Our WT multi-view stereo method (WT-MVSNet) achieves state-of-the-art
performance across multiple datasets and ranks $1^{st}$ on Tanks and Temples
benchmark.
Related papers
- CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - TransY-Net:Learning Fully Transformer Networks for Change Detection of
Remote Sensing Images [64.63004710817239]
We propose a novel Transformer-based learning framework named TransY-Net for remote sensing image CD.
It improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner.
Our proposed method achieves a new state-of-the-art performance on four optical and two SAR image CD benchmarks.
arXiv Detail & Related papers (2023-10-22T07:42:19Z) - DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets [95.84755169585492]
We present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception.
Our model achieves state-of-the-art performance with a broad range of 3D perception tasks.
arXiv Detail & Related papers (2023-01-15T09:31:58Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - Deep Laparoscopic Stereo Matching with Transformers [46.18206008056612]
Self-attention mechanism, successfully employed with the transformer structure, is shown promise in many computer vision tasks.
We propose a new hybrid deep stereo matching framework (HybridStereoNet) that combines the best of the CNN and the transformer in a unified design.
arXiv Detail & Related papers (2022-07-25T12:54:32Z) - TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
Transformer [188.00681648113223]
We explore neat yet effective Transformer-based frameworks for visual grounding.
TransVG establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates.
We upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding.
arXiv Detail & Related papers (2022-06-14T06:27:38Z) - Multiview Stereo with Cascaded Epipolar RAFT [73.7619703879639]
We address multiview stereo (MVS), an important 3D vision task that reconstructs a 3D model such as a dense point cloud from multiple calibrated images.
We propose CER-MVS, a new approach based on the RAFT (Recurrent All-Pairs Field Transforms) architecture developed for optical flow. CER-MVS introduces five new changes to RAFT: epipolar cost volumes, cost volume cascading, multiview fusion of cost volumes, dynamic supervision, and multiresolution fusion of depth maps.
arXiv Detail & Related papers (2022-05-09T18:17:05Z) - Multi-View Stereo with Transformer [31.83069394719813]
This paper proposes a network, referred to as MVSTR, for Multi-View Stereo (MVS)
It is built upon Transformer and is capable of extracting dense features with global context and 3D consistency.
Experimental results show that the proposed MVSTR achieves the best overall performance on the DTU dataset and strong generalization on the Tanks & Temples benchmark dataset.
arXiv Detail & Related papers (2021-12-01T08:06:59Z) - VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View
Selection and Fusion [68.68537312256144]
VoRTX is an end-to-end volumetric 3D reconstruction network using transformers for wide-baseline, multi-view feature fusion.
We train our model on ScanNet and show that it produces better reconstructions than state-of-the-art methods.
arXiv Detail & Related papers (2021-12-01T02:18:11Z) - Multiview Detection with Shadow Transformer (and View-Coherent Data
Augmentation) [25.598840284457548]
We propose a novel multiview detector, MVDeTr, that adopts a shadow transformer to aggregate multiview information.
Unlike convolutions, shadow transformer attends differently at different positions and cameras to deal with various shadow-like distortions.
We report new state-of-the-art accuracy with the proposed system.
arXiv Detail & Related papers (2021-08-12T17:59:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.