Multi-view 3D Reconstruction with Transformer
- URL: http://arxiv.org/abs/2103.12957v1
- Date: Wed, 24 Mar 2021 03:14:49 GMT
- Title: Multi-view 3D Reconstruction with Transformer
- Authors: Dan Wang, Xinrui Cui, Xun Chen, Zhengxia Zou, Tianyang Shi, Septimiu
Salcudean, Z. Jane Wang, Rabab Ward
- Abstract summary: We reformulate the multi-view 3D reconstruction as a sequence-to-sequence prediction problem.
We propose a new framework named 3D Volume Transformer (VolT) for such a task.
Our method achieves a new state-of-the-art accuracy in multi-view reconstruction with fewer parameters.
- Score: 34.756336770583154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep CNN-based methods have so far achieved the state of the art results in
multi-view 3D object reconstruction. Despite the considerable progress, the two
core modules of these methods - multi-view feature extraction and fusion, are
usually investigated separately, and the object relations in different views
are rarely explored. In this paper, inspired by the recent great success in
self-attention-based Transformer models, we reformulate the multi-view 3D
reconstruction as a sequence-to-sequence prediction problem and propose a new
framework named 3D Volume Transformer (VolT) for such a task. Unlike previous
CNN-based methods using a separate design, we unify the feature extraction and
view fusion in a single Transformer network. A natural advantage of our design
lies in the exploration of view-to-view relationships using self-attention
among multiple unordered inputs. On ShapeNet - a large-scale 3D reconstruction
benchmark dataset, our method achieves a new state-of-the-art accuracy in
multi-view reconstruction with fewer parameters ($70\%$ less) than other
CNN-based methods. Experimental results also suggest the strong scaling
capability of our method. Our code will be made publicly available.
Related papers
- Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation [61.040832373015014]
We propose Flex3D, a novel framework for generating high-quality 3D content from text, single images, or sparse view images.
We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object.
In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs.
arXiv Detail & Related papers (2024-10-01T17:29:43Z) - Generative Object Insertion in Gaussian Splatting with a Multi-View Diffusion Model [15.936267489962122]
We propose a novel method for object insertion in 3D content represented by Gaussian Splatting.
Our approach introduces a multi-view diffusion model, dubbed MVInpainter, which is built upon a pre-trained stable video diffusion model.
Within MVInpainter, we incorporate a ControlNet-based conditional injection module to enable controlled and more predictable multi-view generation.
arXiv Detail & Related papers (2024-09-25T13:52:50Z) - MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation [54.27399121779011]
We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images.
We show that our approach can yield more accurate synthesis compared to recent state-of-the-art, including distillation-based 3D inference and prior multi-view generation methods.
arXiv Detail & Related papers (2024-04-04T17:59:57Z) - Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object
Structure via HyperNetworks [53.67497327319569]
We introduce a novel neural rendering technique to solve image-to-3D from a single view.
Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks.
Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
arXiv Detail & Related papers (2023-12-24T08:42:37Z) - MVTN: Learning Multi-View Transformations for 3D Understanding [60.15214023270087]
We introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition.
MVTN can be trained end-to-end with any multi-view network for 3D shape recognition.
Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks.
arXiv Detail & Related papers (2022-12-27T12:09:16Z) - End-to-End Multi-View Structure-from-Motion with Hypercorrelation
Volumes [7.99536002595393]
Deep learning techniques have been proposed to tackle this problem.
We improve on the state-of-the-art two-view structure-from-motion(SfM) approach.
We extend it to the general multi-view case and evaluate it on the complex benchmark dataset DTU.
arXiv Detail & Related papers (2022-09-14T20:58:44Z) - Single-view 3D Mesh Reconstruction for Seen and Unseen Categories [69.29406107513621]
Single-view 3D Mesh Reconstruction is a fundamental computer vision task that aims at recovering 3D shapes from single-view RGB images.
This paper tackles Single-view 3D Mesh Reconstruction, to study the model generalization on unseen categories.
We propose an end-to-end two-stage network, GenMesh, to break the category boundaries in reconstruction.
arXiv Detail & Related papers (2022-08-04T14:13:35Z) - VPFusion: Joint 3D Volume and Pixel-Aligned Feature Fusion for Single
and Multi-view 3D Reconstruction [23.21446438011893]
VPFusionattains high-quality reconstruction using both - 3D feature volume to capture 3D-structure-aware context.
Existing approaches use RNN, feature pooling, or attention computed independently in each view for multi-view fusion.
We show improved multi-view feature fusion by establishing transformer-based pairwise view association.
arXiv Detail & Related papers (2022-03-14T23:30:58Z) - VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View
Selection and Fusion [68.68537312256144]
VoRTX is an end-to-end volumetric 3D reconstruction network using transformers for wide-baseline, multi-view feature fusion.
We train our model on ScanNet and show that it produces better reconstructions than state-of-the-art methods.
arXiv Detail & Related papers (2021-12-01T02:18:11Z) - LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction [45.16128577837725]
Most modern deep learning-based multi-view 3D reconstruction techniques use RNNs or fusion modules to combine information from multiple images after encoding them.
We propose LegoFormer, a transformer-based model that unifies object reconstruction under a single framework and parametrizes the reconstructed occupancy grid by its decomposition factors.
arXiv Detail & Related papers (2021-06-23T00:15:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.