Related papers: Efficient 3D Object Reconstruction using Visual Transformers

Efficient 3D Object Reconstruction using Visual Transformers

URL: http://arxiv.org/abs/2302.08474v1
Date: Thu, 16 Feb 2023 18:33:25 GMT
Title: Efficient 3D Object Reconstruction using Visual Transformers
Authors: Rohan Agarwal, Wei Zhou, Xiaofeng Wu, Yuhan Li
Abstract summary: We set out to use visual transformers in place of convolutions in 3D object reconstruction. Using a transformer-based encoder and decoder to predict 3D structure from 2D images, we achieve accuracy similar or superior to the baseline approach.
Score: 4.670344336401625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reconstructing a 3D object from a 2D image is a well-researched vision problem, with many kinds of deep learning techniques having been tried. Most commonly, 3D convolutional approaches are used, though previous work has shown state-of-the-art methods using 2D convolutions that are also significantly more efficient to train. With the recent rise of transformers for vision tasks, often outperforming convolutional methods, along with some earlier attempts to use transformers for 3D object reconstruction, we set out to use visual transformers in place of convolutions in existing efficient, high-performing techniques for 3D object reconstruction in order to achieve superior results on the task. Using a transformer-based encoder and decoder to predict 3D structure from 2D images, we achieve accuracy similar or superior to the baseline approach. This study serves as evidence for the potential of visual transformers in the task of 3D object reconstruction.

Related papers

HORT: Monocular Hand-held Objects Reconstruction with Transformers [61.36376511119355]
Reconstructing hand-held objects in 3D from monocular images is a significant challenge in computer vision. We propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. Our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.
arXiv Detail & Related papers (2025-03-27T09:45:09Z)
DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation [53.20147419879056]
We introduce a diffusion-based feed-forward framework to address challenges with a single model. Building upon our 3D-aware Diffusion model with TransFormer, we propose a stronger version for 3D generation, i.e., DiffTF++. Experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules.
arXiv Detail & Related papers (2024-05-13T17:59:51Z)
DIG3D: Marrying Gaussian Splatting with Deformable Transformer for Single Image 3D Reconstruction [12.408610403423559]
We propose a novel approach called DIG3D for 3D object reconstruction and novel view synthesis. Our method utilizes an encoder-decoder framework which generates 3D Gaussians in decoder with the guidance of depth-aware image features from encoder. We evaluate our method on the ShapeNet SRN dataset, getting PSNR of 24.21 and 24.98 in car and chair dataset, respectively.
arXiv Detail & Related papers (2024-04-25T04:18:59Z)
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z)
IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation [96.32684334038278]
In this paper, we explore the design space of text-to-3D models. We significantly improve multi-view generation by considering video instead of image generators. Our new method, IM-3D, reduces the number of evaluations of the 2D generator network 10-100x.
arXiv Detail & Related papers (2024-02-13T18:59:51Z)
MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices [78.20154723650333]
High-quality 3D ground-truth shapes are critical for 3D object reconstruction evaluation. We introduce a novel multi-view RGBD dataset captured using a mobile device. We obtain precise 3D ground-truth shape without relying on high-end 3D scanners.
arXiv Detail & Related papers (2023-03-03T14:02:50Z)
Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors [29.419069066603438]
We propose Mask3D to leverage existing large-scale RGB-D data in a self-supervised pre-training to embed 3D priors into 2D learned feature representations. We demonstrate Mask3D is particularly effective in embedding 3D priors into the powerful 2D ViT backbone, enabling improved representation learning for various scene understanding tasks.
arXiv Detail & Related papers (2023-02-28T16:45:21Z)
Adapting Pre-trained Vision Transformers from 2D to 3D through Weight Inflation Improves Medical Image Segmentation [19.693778706169752]
We use a weight inflation strategy to adapt pre-trained Transformers from 2D to 3D, retaining the benefit of both transfer learning and depth information. Our approach achieves state-of-the-art performances across a broad range of 3D medical image datasets.
arXiv Detail & Related papers (2023-02-08T19:38:13Z)
3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field. We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks. We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z)
3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers [12.238921770499912]
3D-RETR is able to perform end-to-end 3D REconstruction with TRansformers. 3D-RETR first uses a pretrained Transformer to extract visual features from 2D input images. A CNN Decoder then takes as input the voxel features to obtain the reconstructed objects.
arXiv Detail & Related papers (2021-10-17T16:19:15Z)
Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras. We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points. Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.