MVSTER: Epipolar Transformer for Efficient Multi-View Stereo
- URL: http://arxiv.org/abs/2204.07346v1
- Date: Fri, 15 Apr 2022 06:47:57 GMT
- Title: MVSTER: Epipolar Transformer for Efficient Multi-View Stereo
- Authors: Xiaofeng Wang, Zheng Zhu, Fangbo Qin, Yun Ye, Guan Huang, Xu Chi,
Yijia He and Xingang Wang
- Abstract summary: Learning-based Multi-View Stereo methods warp source images into 3D volumes, which are fused as a cost volume to be regularized by subsequent networks.
Previous methods utilize extra networks to learn 2D information as fusing cues, underusing 3D spatial correlations and bringing additional computation costs.
We present MVSTER, which leverages the proposed epipolar Transformer to learn both 2D semantics and 3D spatial associations efficiently.
- Score: 26.640495084316925
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning-based Multi-View Stereo (MVS) methods warp source images into the
reference camera frustum to form 3D volumes, which are fused as a cost volume
to be regularized by subsequent networks. The fusing step plays a vital role in
bridging 2D semantics and 3D spatial associations. However, previous methods
utilize extra networks to learn 2D information as fusing cues, underusing 3D
spatial correlations and bringing additional computation costs. Therefore, we
present MVSTER, which leverages the proposed epipolar Transformer to learn both
2D semantics and 3D spatial associations efficiently. Specifically, the
epipolar Transformer utilizes a detachable monocular depth estimator to enhance
2D semantics and uses cross-attention to construct data-dependent 3D
associations along epipolar line. Additionally, MVSTER is built in a cascade
structure, where entropy-regularized optimal transport is leveraged to
propagate finer depth estimations in each stage. Extensive experiments show
MVSTER achieves state-of-the-art reconstruction performance with significantly
higher efficiency: Compared with MVSNet and CasMVSNet, our MVSTER achieves 34%
and 14% relative improvements on the DTU benchmark, with 80% and 51% relative
reductions in running time. MVSTER also ranks first on Tanks&Temples-Advanced
among all published works. Code is released at https://github.com/JeffWang987.
Related papers
- SparseFusion: Efficient Sparse Multi-Modal Fusion Framework for Long-Range 3D Perception [47.000734648271006]
We introduce SparseFusion, a novel multi-modal fusion framework built upon sparse 3D features to facilitate efficient long-range perception.
The proposed module introduces sparsity from both semantic and geometric aspects which only fill grids that foreground objects potentially reside in.
On the long-range Argoverse2 dataset, SparseFusion reduces memory footprint and accelerates the inference by about two times compared to dense detectors.
arXiv Detail & Related papers (2024-03-15T05:59:10Z) - NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized
Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs.
We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels.
We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z) - Fully Sparse Fusion for 3D Object Detection [69.32694845027927]
Currently prevalent multimodal 3D detection methods are built upon LiDAR-based detectors that usually use dense Bird's-Eye-View feature maps.
Fully sparse architecture is gaining attention as they are highly efficient in long-range perception.
In this paper, we study how to effectively leverage image modality in the emerging fully sparse architecture.
arXiv Detail & Related papers (2023-04-24T17:57:43Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - Monocular Scene Reconstruction with 3D SDF Transformers [17.565474518578178]
We propose an SDF transformer network, which replaces the role of 3D CNN for better 3D feature aggregation.
Experiments on multiple datasets show that this 3D transformer network generates a more accurate and complete reconstruction.
arXiv Detail & Related papers (2023-01-31T09:54:20Z) - VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose
Estimation [4.603321798937854]
Volumetric Transformer Pose estimator (VTP) is the first 3D transformer framework for multi-view multi-person 3D human pose estimation.
VTP aggregates features from 2D keypoints in all camera views and learns the relationships in the 3D voxel space in an end-to-end fashion.
arXiv Detail & Related papers (2022-05-25T09:26:42Z) - MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation.
We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation.
In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z) - Multi-initialization Optimization Network for Accurate 3D Human Pose and
Shape Estimation [75.44912541912252]
We propose a three-stage framework named Multi-Initialization Optimization Network (MION)
In the first stage, we strategically select different coarse 3D reconstruction candidates which are compatible with the 2D keypoints of input sample.
In the second stage, we design a mesh refinement transformer (MRT) to respectively refine each coarse reconstruction result via a self-attention mechanism.
Finally, a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction.
arXiv Detail & Related papers (2021-12-24T02:43:58Z) - SRH-Net: Stacked Recurrent Hourglass Network for Stereo Matching [33.66537830990198]
We decouple the 4D cubic cost volume used by 3D convolutional filters into sequential cost maps along the direction of disparity.
A novel recurrent module, Stacked Recurrent Hourglass (SRH), is proposed to process each cost map.
The proposed architecture is implemented in an end-to-end pipeline and evaluated on public datasets.
arXiv Detail & Related papers (2021-05-25T00:10:56Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.