3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with
Transformers
- URL: http://arxiv.org/abs/2110.08861v1
- Date: Sun, 17 Oct 2021 16:19:15 GMT
- Title: 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with
Transformers
- Authors: Zai Shi, Zhao Meng, Yiran Xing, Yunpu Ma, Roger Wattenhofer
- Abstract summary: 3D-RETR is able to perform end-to-end 3D REconstruction with TRansformers.
3D-RETR first uses a pretrained Transformer to extract visual features from 2D input images.
A CNN Decoder then takes as input the voxel features to obtain the reconstructed objects.
- Score: 12.238921770499912
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D reconstruction aims to reconstruct 3D objects from 2D views. Previous
works for 3D reconstruction mainly focus on feature matching between views or
using CNNs as backbones. Recently, Transformers have been shown effective in
multiple applications of computer vision. However, whether or not Transformers
can be used for 3D reconstruction is still unclear. In this paper, we fill this
gap by proposing 3D-RETR, which is able to perform end-to-end 3D REconstruction
with TRansformers. 3D-RETR first uses a pretrained Transformer to extract
visual features from 2D input images. 3D-RETR then uses another Transformer
Decoder to obtain the voxel features. A CNN Decoder then takes as input the
voxel features to obtain the reconstructed objects. 3D-RETR is capable of 3D
reconstruction from a single view or multiple views. Experimental results on
two datasets show that 3DRETR reaches state-of-the-art performance on 3D
reconstruction. Additional ablation study also demonstrates that 3D-DETR
benefits from using Transformers.
Related papers
- IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality
3D Generation [96.32684334038278]
In this paper, we explore the design space of text-to-3D models.
We significantly improve multi-view generation by considering video instead of image generators.
Our new method, IM-3D, reduces the number of evaluations of the 2D generator network 10-100x.
arXiv Detail & Related papers (2024-02-13T18:59:51Z) - R3D-SWIN:Use Shifted Window Attention for Single-View 3D Reconstruction [0.565395466029518]
We propose a voxel 3D reconstruction network based on shifted window attention.
Experimental results on ShapeNet verify our method achieves SOTA accuracy in single-view reconstruction.
arXiv Detail & Related papers (2023-12-05T12:42:37Z) - MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices [78.20154723650333]
High-quality 3D ground-truth shapes are critical for 3D object reconstruction evaluation.
We introduce a novel multi-view RGBD dataset captured using a mobile device.
We obtain precise 3D ground-truth shape without relying on high-end 3D scanners.
arXiv Detail & Related papers (2023-03-03T14:02:50Z) - Efficient 3D Object Reconstruction using Visual Transformers [4.670344336401625]
We set out to use visual transformers in place of convolutions in 3D object reconstruction.
Using a transformer-based encoder and decoder to predict 3D structure from 2D images, we achieve accuracy similar or superior to the baseline approach.
arXiv Detail & Related papers (2023-02-16T18:33:25Z) - Bridged Transformer for Vision and Point Cloud 3D Object Detection [92.86856146086316]
Bridged Transformer (BrT) is an end-to-end architecture for 3D object detection.
BrT learns to identify 3D and 2D object bounding boxes from both points and image patches.
We experimentally show that BrT surpasses state-of-the-art methods on SUN RGB-D and ScanNetV2 datasets.
arXiv Detail & Related papers (2022-10-04T05:44:22Z) - RayTran: 3D pose estimation and shape reconstruction of multiple objects
from videos with ray-traced transformers [41.499325832227626]
We propose a transformer-based neural network architecture for multi-object 3D reconstruction from RGB videos.
We exploit knowledge about the image formation process to significantly sparsify the attention weight matrix.
Compared to previous methods, our architecture is single stage, end-to-end trainable.
arXiv Detail & Related papers (2022-03-24T18:49:12Z) - Voxel-based 3D Detection and Reconstruction of Multiple Objects from a
Single Image [22.037472446683765]
We learn a regular grid of 3D voxel features from the input image which is aligned with 3D scene space via a 3D feature lifting operator.
Based on the 3D voxel features, our novel CenterNet-3D detection head formulates the 3D detection as keypoint detection in the 3D space.
We devise an efficient coarse-to-fine reconstruction module, including coarse-level voxelization and a novel local PCA-SDF shape representation.
arXiv Detail & Related papers (2021-11-04T18:30:37Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z) - Pix2Vox++: Multi-scale Context-aware 3D Object Reconstruction from
Single and Multiple Images [56.652027072552606]
We propose a novel framework for single-view and multi-view 3D object reconstruction, named Pix2Vox++.
By using a well-designed encoder-decoder, it generates a coarse 3D volume from each input image.
A multi-scale context-aware fusion module is then introduced to adaptively select high-quality reconstructions for different parts from all coarse 3D volumes to obtain a fused 3D volume.
arXiv Detail & Related papers (2020-06-22T13:48:09Z) - Implicit Functions in Feature Space for 3D Shape Reconstruction and
Completion [53.885984328273686]
Implicit Feature Networks (IF-Nets) deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data.
IF-Nets clearly outperform prior work in 3D object reconstruction in ShapeNet, and obtain significantly more accurate 3D human reconstructions.
arXiv Detail & Related papers (2020-03-03T11:14:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.