Related papers: 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers

3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers

URL: http://arxiv.org/abs/2110.08861v1
Date: Sun, 17 Oct 2021 16:19:15 GMT
Title: 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers
Authors: Zai Shi, Zhao Meng, Yiran Xing, Yunpu Ma, Roger Wattenhofer
Abstract summary: 3D-RETR is able to perform end-to-end 3D REconstruction with TRansformers. 3D-RETR first uses a pretrained Transformer to extract visual features from 2D input images. A CNN Decoder then takes as input the voxel features to obtain the reconstructed objects.
Score: 12.238921770499912
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 3D reconstruction aims to reconstruct 3D objects from 2D views. Previous works for 3D reconstruction mainly focus on feature matching between views or using CNNs as backbones. Recently, Transformers have been shown effective in multiple applications of computer vision. However, whether or not Transformers can be used for 3D reconstruction is still unclear. In this paper, we fill this gap by proposing 3D-RETR, which is able to perform end-to-end 3D REconstruction with TRansformers. 3D-RETR first uses a pretrained Transformer to extract visual features from 2D input images. 3D-RETR then uses another Transformer Decoder to obtain the voxel features. A CNN Decoder then takes as input the voxel features to obtain the reconstructed objects. 3D-RETR is capable of 3D reconstruction from a single view or multiple views. Experimental results on two datasets show that 3DRETR reaches state-of-the-art performance on 3D reconstruction. Additional ablation study also demonstrates that 3D-DETR benefits from using Transformers.

Related papers

SR3D: Unleashing Single-view 3D Reconstruction for Transparent and Specular Object Grasping [7.222966501323922]
We propose a training free framework SR3D that enables robotic grasping of transparent and specular objects from a single view observation.<n>Specifically, given single view RGB and depth images, SR3D first uses the external visual models to generate 3D reconstructed object mesh.<n>Then, the key idea is to determine the 3D object's pose and scale to accurately localize the reconstructed object back into its original depth corrupted 3D scene.
arXiv Detail & Related papers (2025-05-30T07:38:46Z)
TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction [137.34863114016483]
TAR3D is a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQ-VAE) and a Generative Pre-trained Transformer (GPT) We show that TAR3D can achieve superior generation quality over existing methods in text-to-3D and image-to-3D tasks.
arXiv Detail & Related papers (2024-12-22T08:28:20Z)
IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation [96.32684334038278]
In this paper, we explore the design space of text-to-3D models. We significantly improve multi-view generation by considering video instead of image generators. Our new method, IM-3D, reduces the number of evaluations of the 2D generator network 10-100x.
arXiv Detail & Related papers (2024-02-13T18:59:51Z)
R3D-SWIN:Use Shifted Window Attention for Single-View 3D Reconstruction [0.565395466029518]
We propose a voxel 3D reconstruction network based on shifted window attention. Experimental results on ShapeNet verify our method achieves SOTA accuracy in single-view reconstruction.
arXiv Detail & Related papers (2023-12-05T12:42:37Z)
MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices [78.20154723650333]
High-quality 3D ground-truth shapes are critical for 3D object reconstruction evaluation. We introduce a novel multi-view RGBD dataset captured using a mobile device. We obtain precise 3D ground-truth shape without relying on high-end 3D scanners.
arXiv Detail & Related papers (2023-03-03T14:02:50Z)
Efficient 3D Object Reconstruction using Visual Transformers [4.670344336401625]
We set out to use visual transformers in place of convolutions in 3D object reconstruction. Using a transformer-based encoder and decoder to predict 3D structure from 2D images, we achieve accuracy similar or superior to the baseline approach.
arXiv Detail & Related papers (2023-02-16T18:33:25Z)
Bridged Transformer for Vision and Point Cloud 3D Object Detection [92.86856146086316]
Bridged Transformer (BrT) is an end-to-end architecture for 3D object detection. BrT learns to identify 3D and 2D object bounding boxes from both points and image patches. We experimentally show that BrT surpasses state-of-the-art methods on SUN RGB-D and ScanNetV2 datasets.
arXiv Detail & Related papers (2022-10-04T05:44:22Z)
RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers [41.499325832227626]
We propose a transformer-based neural network architecture for multi-object 3D reconstruction from RGB videos. We exploit knowledge about the image formation process to significantly sparsify the attention weight matrix. Compared to previous methods, our architecture is single stage, end-to-end trainable.
arXiv Detail & Related papers (2022-03-24T18:49:12Z)
Voxel-based 3D Detection and Reconstruction of Multiple Objects from a Single Image [22.037472446683765]
We learn a regular grid of 3D voxel features from the input image which is aligned with 3D scene space via a 3D feature lifting operator. Based on the 3D voxel features, our novel CenterNet-3D detection head formulates the 3D detection as keypoint detection in the 3D space. We devise an efficient coarse-to-fine reconstruction module, including coarse-level voxelization and a novel local PCA-SDF shape representation.
arXiv Detail & Related papers (2021-11-04T18:30:37Z)
3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images. First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training. Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration. Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z)
Pix2Vox++: Multi-scale Context-aware 3D Object Reconstruction from Single and Multiple Images [56.652027072552606]
We propose a novel framework for single-view and multi-view 3D object reconstruction, named Pix2Vox++. By using a well-designed encoder-decoder, it generates a coarse 3D volume from each input image. A multi-scale context-aware fusion module is then introduced to adaptively select high-quality reconstructions for different parts from all coarse 3D volumes to obtain a fused 3D volume.
arXiv Detail & Related papers (2020-06-22T13:48:09Z)
Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion [53.885984328273686]
Implicit Feature Networks (IF-Nets) deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data. IF-Nets clearly outperform prior work in 3D object reconstruction in ShapeNet, and obtain significantly more accurate 3D human reconstructions.
arXiv Detail & Related papers (2020-03-03T11:14:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.