MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View
Stereo
- URL: http://arxiv.org/abs/2401.11673v1
- Date: Mon, 22 Jan 2024 03:22:49 GMT
- Title: MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View
Stereo
- Authors: Chenjie Cao, Xinlin Ren, Yanwei Fu
- Abstract summary: We introduce MVSFormer++, a method that maximizes the inherent characteristics of attention to enhance various components of the MVS pipeline.
We employ different attention mechanisms for the feature encoder and cost volume regularization, focusing on feature and spatial aggregations respectively.
Comprehensive experiments on DTU, Tanks-and-Temples, BlendedMVS, and ETH3D validate the effectiveness of the proposed method.
- Score: 60.75684891484619
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in learning-based Multi-View Stereo (MVS) methods have
prominently featured transformer-based models with attention mechanisms.
However, existing approaches have not thoroughly investigated the profound
influence of transformers on different MVS modules, resulting in limited depth
estimation capabilities. In this paper, we introduce MVSFormer++, a method that
prudently maximizes the inherent characteristics of attention to enhance
various components of the MVS pipeline. Formally, our approach involves
infusing cross-view information into the pre-trained DINOv2 model to facilitate
MVS learning. Furthermore, we employ different attention mechanisms for the
feature encoder and cost volume regularization, focusing on feature and spatial
aggregations respectively. Additionally, we uncover that some design details
would substantially impact the performance of transformer modules in MVS,
including normalized 3D positional encoding, adaptive attention scaling, and
the position of layer normalization. Comprehensive experiments on DTU,
Tanks-and-Temples, BlendedMVS, and ETH3D validate the effectiveness of the
proposed method. Notably, MVSFormer++ achieves state-of-the-art performance on
the challenging DTU and Tanks-and-Temples benchmarks.
Related papers
- MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers [41.54004590821323]
We propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for multimodal semantic features.
Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer.
Unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features.
arXiv Detail & Related papers (2024-06-07T13:35:44Z) - SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation [74.07836010698801]
We propose an SMPL-based Transformer framework (SMPLer) to address this issue.
SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation.
Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods.
arXiv Detail & Related papers (2024-04-23T17:59:59Z) - CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - MMViT: Multiscale Multiview Vision Transformers [36.93551299085767]
We present Multiscale Multiview Vision Transformers (MMViT), which introduces multiscale feature maps and multiview encodings to transformer models.
Our model encodes different views of the input signal and builds several channel-resolution feature stages to process the multiple views of the input at different resolutions in parallel.
We demonstrate the effectiveness of MMViT on audio and image classification tasks, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-04-28T21:51:41Z) - Demystify Transformers & Convolutions in Modern Image Deep Networks [82.32018252867277]
This paper aims to identify the real gains of popular convolution and attention operators through a detailed study.
We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach.
Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs.
arXiv Detail & Related papers (2022-11-10T18:59:43Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - Multiview Stereo with Cascaded Epipolar RAFT [73.7619703879639]
We address multiview stereo (MVS), an important 3D vision task that reconstructs a 3D model such as a dense point cloud from multiple calibrated images.
We propose CER-MVS, a new approach based on the RAFT (Recurrent All-Pairs Field Transforms) architecture developed for optical flow. CER-MVS introduces five new changes to RAFT: epipolar cost volumes, cost volume cascading, multiview fusion of cost volumes, dynamic supervision, and multiresolution fusion of depth maps.
arXiv Detail & Related papers (2022-05-09T18:17:05Z) - Multi-View Stereo with Transformer [31.83069394719813]
This paper proposes a network, referred to as MVSTR, for Multi-View Stereo (MVS)
It is built upon Transformer and is capable of extracting dense features with global context and 3D consistency.
Experimental results show that the proposed MVSTR achieves the best overall performance on the DTU dataset and strong generalization on the Tanks & Temples benchmark dataset.
arXiv Detail & Related papers (2021-12-01T08:06:59Z) - TransMVSNet: Global Context-aware Multi-view Stereo Network with
Transformers [6.205844084751411]
We present TransMVSNet, based on our exploration of feature matching in multi-view stereo (MVS)
We propose a powerful Feature Matching Transformer (FMT) to leverage intra- (self-) and inter- (cross-) attention to aggregate long-range context information.
Our method achieves state-of-the-art performance on DTU dataset, Tanks and Temples benchmark, and BlendedMVS dataset.
arXiv Detail & Related papers (2021-11-29T15:31:49Z) - Digging into Uncertainty in Self-supervised Multi-view Stereo [57.04768354383339]
We propose a novel Uncertainty reduction Multi-view Stereo (UMVS) framework for self-supervised learning.
Our framework achieves the best performance among unsupervised MVS methods, with competitive performance with its supervised opponents.
arXiv Detail & Related papers (2021-08-30T02:53:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.