VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View
Selection and Fusion
- URL: http://arxiv.org/abs/2112.00236v1
- Date: Wed, 1 Dec 2021 02:18:11 GMT
- Title: VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View
Selection and Fusion
- Authors: Noah Stier, Alexander Rich, Pradeep Sen, Tobias H\"ollerer
- Abstract summary: VoRTX is an end-to-end volumetric 3D reconstruction network using transformers for wide-baseline, multi-view feature fusion.
We train our model on ScanNet and show that it produces better reconstructions than state-of-the-art methods.
- Score: 68.68537312256144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent volumetric 3D reconstruction methods can produce very accurate
results, with plausible geometry even for unobserved surfaces. However, they
face an undesirable trade-off when it comes to multi-view fusion. They can fuse
all available view information by global averaging, thus losing fine detail, or
they can heuristically cluster views for local fusion, thus restricting their
ability to consider all views jointly. Our key insight is that greater detail
can be retained without restricting view diversity by learning a view-fusion
function conditioned on camera pose and image content. We propose to learn this
multi-view fusion using a transformer. To this end, we introduce VoRTX, an
end-to-end volumetric 3D reconstruction network using transformers for
wide-baseline, multi-view feature fusion. Our model is occlusion-aware,
leveraging the transformer architecture to predict an initial, projective scene
geometry estimate. This estimate is used to avoid backprojecting image features
through surfaces into occluded regions. We train our model on ScanNet and show
that it produces better reconstructions than state-of-the-art methods. We also
demonstrate generalization without any fine-tuning, outperforming the same
state-of-the-art methods on two other datasets, TUM-RGBD and ICL-NUIM.
Related papers
- GenS: Generalizable Neural Surface Reconstruction from Multi-View Images [20.184657468900852]
GenS is an end-to-end generalizable neural surface reconstruction model.
Our representation is more powerful, which can recover high-frequency details while maintaining global smoothness.
Experiments on popular benchmarks show that our model can generalize well to new scenes.
arXiv Detail & Related papers (2024-06-04T17:13:10Z) - MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation [54.27399121779011]
We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images.
We show that our approach can yield more accurate synthesis compared to recent state-of-the-art, including distillation-based 3D inference and prior multi-view generation methods.
arXiv Detail & Related papers (2024-04-04T17:59:57Z) - Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object
Structure via HyperNetworks [53.67497327319569]
We introduce a novel neural rendering technique to solve image-to-3D from a single view.
Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks.
Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
arXiv Detail & Related papers (2023-12-24T08:42:37Z) - UpFusion: Novel View Diffusion from Unposed Sparse View Observations [66.36092764694502]
UpFusion can perform novel view synthesis and infer 3D representations for an object given a sparse set of reference images.
We show that this mechanism allows generating high-fidelity novel views while improving the synthesis quality given additional (unposed) images.
arXiv Detail & Related papers (2023-12-11T18:59:55Z) - MVTN: Learning Multi-View Transformations for 3D Understanding [60.15214023270087]
We introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition.
MVTN can be trained end-to-end with any multi-view network for 3D shape recognition.
Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks.
arXiv Detail & Related papers (2022-12-27T12:09:16Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z) - VPFusion: Joint 3D Volume and Pixel-Aligned Feature Fusion for Single
and Multi-view 3D Reconstruction [23.21446438011893]
VPFusionattains high-quality reconstruction using both - 3D feature volume to capture 3D-structure-aware context.
Existing approaches use RNN, feature pooling, or attention computed independently in each view for multi-view fusion.
We show improved multi-view feature fusion by establishing transformer-based pairwise view association.
arXiv Detail & Related papers (2022-03-14T23:30:58Z) - TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [26.87200488085741]
TransformerFusion is a transformer-based 3D scene reconstruction approach.
Network learns to attend to the most relevant image frames for each 3D location in the scene.
Features are fused in a coarse-to-fine fashion, storing fine-level features only where needed.
arXiv Detail & Related papers (2021-07-05T18:00:11Z) - LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction [45.16128577837725]
Most modern deep learning-based multi-view 3D reconstruction techniques use RNNs or fusion modules to combine information from multiple images after encoding them.
We propose LegoFormer, a transformer-based model that unifies object reconstruction under a single framework and parametrizes the reconstructed occupancy grid by its decomposition factors.
arXiv Detail & Related papers (2021-06-23T00:15:08Z) - Multi-view 3D Reconstruction with Transformer [34.756336770583154]
We reformulate the multi-view 3D reconstruction as a sequence-to-sequence prediction problem.
We propose a new framework named 3D Volume Transformer (VolT) for such a task.
Our method achieves a new state-of-the-art accuracy in multi-view reconstruction with fewer parameters.
arXiv Detail & Related papers (2021-03-24T03:14:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.