Multi-View Stereo with Transformer
- URL: http://arxiv.org/abs/2112.00336v1
- Date: Wed, 1 Dec 2021 08:06:59 GMT
- Title: Multi-View Stereo with Transformer
- Authors: Jie Zhu, Bo Peng, Wanqing Li, Haifeng Shen, Zhe Zhang, Jianjun Lei
- Abstract summary: This paper proposes a network, referred to as MVSTR, for Multi-View Stereo (MVS)
It is built upon Transformer and is capable of extracting dense features with global context and 3D consistency.
Experimental results show that the proposed MVSTR achieves the best overall performance on the DTU dataset and strong generalization on the Tanks & Temples benchmark dataset.
- Score: 31.83069394719813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a network, referred to as MVSTR, for Multi-View Stereo
(MVS). It is built upon Transformer and is capable of extracting dense features
with global context and 3D consistency, which are crucial to achieving reliable
matching for MVS. Specifically, to tackle the problem of the limited receptive
field of existing CNN-based MVS methods, a global-context Transformer module is
first proposed to explore intra-view global context. In addition, to further
enable dense features to be 3D-consistent, a 3D-geometry Transformer module is
built with a well-designed cross-view attention mechanism to facilitate
inter-view information interaction. Experimental results show that the proposed
MVSTR achieves the best overall performance on the DTU dataset and strong
generalization on the Tanks & Temples benchmark dataset.
Related papers
- MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D
Object Detection [19.8309983660935]
MsSVT++ is an innovative Mixed-scale Sparse Voxel Transformer.
It simultaneously captures both types of information through a divide-and-conquer approach.
MsSVT++ consistently delivers exceptional performance across diverse datasets.
arXiv Detail & Related papers (2024-01-22T06:42:23Z) - MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View
Stereo [60.75684891484619]
We introduce MVSFormer++, a method that maximizes the inherent characteristics of attention to enhance various components of the MVS pipeline.
We employ different attention mechanisms for the feature encoder and cost volume regularization, focusing on feature and spatial aggregations respectively.
Comprehensive experiments on DTU, Tanks-and-Temples, BlendedMVS, and ETH3D validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-01-22T03:22:49Z) - CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - SIM-Trans: Structure Information Modeling Transformer for Fine-grained
Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning.
The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily.
Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z) - TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D
Salient Object Detection [86.94578023985677]
In this work, we rethink this task from the perspective of global information alignment and transformation.
Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path.
Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods.
arXiv Detail & Related papers (2021-12-04T15:45:34Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View
Selection and Fusion [68.68537312256144]
VoRTX is an end-to-end volumetric 3D reconstruction network using transformers for wide-baseline, multi-view feature fusion.
We train our model on ScanNet and show that it produces better reconstructions than state-of-the-art methods.
arXiv Detail & Related papers (2021-12-01T02:18:11Z) - TransMVSNet: Global Context-aware Multi-view Stereo Network with
Transformers [6.205844084751411]
We present TransMVSNet, based on our exploration of feature matching in multi-view stereo (MVS)
We propose a powerful Feature Matching Transformer (FMT) to leverage intra- (self-) and inter- (cross-) attention to aggregate long-range context information.
Our method achieves state-of-the-art performance on DTU dataset, Tanks and Temples benchmark, and BlendedMVS dataset.
arXiv Detail & Related papers (2021-11-29T15:31:49Z) - AA-RMVSNet: Adaptive Aggregation Recurrent Multi-view Stereo Network [8.127449025802436]
We present a novel recurrent multi-view stereo network based on long short-term memory (LSTM) with adaptive aggregation, namely AA-RMVSNet.
We firstly introduce an intra-view aggregation module to adaptively extract image features by using context-aware convolution and multi-scale aggregation.
We propose an inter-view cost volume aggregation module for adaptive pixel-wise view aggregation, which is able to preserve better-matched pairs among all views.
arXiv Detail & Related papers (2021-08-09T06:10:48Z) - TransVG: End-to-End Visual Grounding with Transformers [102.11922622103613]
We present a transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to an image.
We show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
arXiv Detail & Related papers (2021-04-17T13:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.