Efficient 3D Object Reconstruction using Visual Transformers
- URL: http://arxiv.org/abs/2302.08474v1
- Date: Thu, 16 Feb 2023 18:33:25 GMT
- Title: Efficient 3D Object Reconstruction using Visual Transformers
- Authors: Rohan Agarwal, Wei Zhou, Xiaofeng Wu, Yuhan Li
- Abstract summary: We set out to use visual transformers in place of convolutions in 3D object reconstruction.
Using a transformer-based encoder and decoder to predict 3D structure from 2D images, we achieve accuracy similar or superior to the baseline approach.
- Score: 4.670344336401625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reconstructing a 3D object from a 2D image is a well-researched vision
problem, with many kinds of deep learning techniques having been tried. Most
commonly, 3D convolutional approaches are used, though previous work has shown
state-of-the-art methods using 2D convolutions that are also significantly more
efficient to train. With the recent rise of transformers for vision tasks,
often outperforming convolutional methods, along with some earlier attempts to
use transformers for 3D object reconstruction, we set out to use visual
transformers in place of convolutions in existing efficient, high-performing
techniques for 3D object reconstruction in order to achieve superior results on
the task. Using a transformer-based encoder and decoder to predict 3D structure
from 2D images, we achieve accuracy similar or superior to the baseline
approach. This study serves as evidence for the potential of visual
transformers in the task of 3D object reconstruction.
Related papers
- DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation [53.20147419879056]
We introduce a diffusion-based feed-forward framework to address challenges with a single model.
Building upon our 3D-aware Diffusion model with TransFormer, we propose a stronger version for 3D generation, i.e., DiffTF++.
Experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules.
arXiv Detail & Related papers (2024-05-13T17:59:51Z) - DIG3D: Marrying Gaussian Splatting with Deformable Transformer for Single Image 3D Reconstruction [12.408610403423559]
We propose a novel approach called DIG3D for 3D object reconstruction and novel view synthesis.
Our method utilizes an encoder-decoder framework which generates 3D Gaussians in decoder with the guidance of depth-aware image features from encoder.
We evaluate our method on the ShapeNet SRN dataset, getting PSNR of 24.21 and 24.98 in car and chair dataset, respectively.
arXiv Detail & Related papers (2024-04-25T04:18:59Z) - Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality
3D Generation [96.32684334038278]
In this paper, we explore the design space of text-to-3D models.
We significantly improve multi-view generation by considering video instead of image generators.
Our new method, IM-3D, reduces the number of evaluations of the 2D generator network 10-100x.
arXiv Detail & Related papers (2024-02-13T18:59:51Z) - MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices [78.20154723650333]
High-quality 3D ground-truth shapes are critical for 3D object reconstruction evaluation.
We introduce a novel multi-view RGBD dataset captured using a mobile device.
We obtain precise 3D ground-truth shape without relying on high-end 3D scanners.
arXiv Detail & Related papers (2023-03-03T14:02:50Z) - Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors [29.419069066603438]
We propose Mask3D to leverage existing large-scale RGB-D data in a self-supervised pre-training to embed 3D priors into 2D learned feature representations.
We demonstrate Mask3D is particularly effective in embedding 3D priors into the powerful 2D ViT backbone, enabling improved representation learning for various scene understanding tasks.
arXiv Detail & Related papers (2023-02-28T16:45:21Z) - 3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field.
We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks.
We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z) - 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with
Transformers [12.238921770499912]
3D-RETR is able to perform end-to-end 3D REconstruction with TRansformers.
3D-RETR first uses a pretrained Transformer to extract visual features from 2D input images.
A CNN Decoder then takes as input the voxel features to obtain the reconstructed objects.
arXiv Detail & Related papers (2021-10-17T16:19:15Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.