Related papers: TransformerFusion: Monocular RGB Scene Reconstruction using Transformers

TransformerFusion: Monocular RGB Scene Reconstruction using Transformers

URL: http://arxiv.org/abs/2107.02191v1
Date: Mon, 5 Jul 2021 18:00:11 GMT
Title: TransformerFusion: Monocular RGB Scene Reconstruction using Transformers
Authors: Alja\v{z} Bo\v{z}i\v{c}, Pablo Palafox, Justus Thies, Angela Dai, Matthias Nie{\ss}ner
Abstract summary: TransformerFusion is a transformer-based 3D scene reconstruction approach. Network learns to attend to the most relevant image frames for each 3D location in the scene. Features are fused in a coarse-to-fine fashion, storing fine-level features only where needed.
Score: 26.87200488085741
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce TransformerFusion, a transformer-based 3D scene reconstruction approach. From an input monocular RGB video, the video frames are processed by a transformer network that fuses the observations into a volumetric feature grid representing the scene; this feature grid is then decoded into an implicit 3D scene representation. Key to our approach is the transformer architecture that enables the network to learn to attend to the most relevant image frames for each 3D location in the scene, supervised only by the scene reconstruction task. Features are fused in a coarse-to-fine fashion, storing fine-level features only where needed, requiring lower memory storage and enabling fusion at interactive rates. The feature grid is then decoded to a higher-resolution scene reconstruction, using an MLP-based surface occupancy prediction from interpolated coarse-to-fine 3D features. Our approach results in an accurate surface reconstruction, outperforming state-of-the-art multi-view stereo depth estimation methods, fully-convolutional 3D reconstruction approaches, and approaches using LSTM- or GRU-based recurrent networks for video sequence fusion.

Related papers

RGB-Phase Speckle: Cross-Scene Stereo 3D Reconstruction via Wrapped Pre-Normalization [9.20903035677888]
This study introduces RGB-Speckle, a cross-scene 3D reconstruction framework based on an active stereo camera system. We propose a novel phase pre-normalization encoding-decoding method, which mitigates external interference. Experimental results demonstrate that the proposed RGB-Speckle model offers significant advantages in cross-domain and cross-scene 3D reconstruction tasks.
arXiv Detail & Related papers (2025-03-08T08:37:20Z)
Refine3DNet: Scaling Precision in 3D Object Reconstruction from Multi-View RGB Images using Attention [2.037112541541094]
We introduce a hybrid strategy featuring a visual auto-encoder with self-attention mechanisms and a 3D refiner network. Our approach, combined with JTSO, outperforms state-of-the-art techniques in single and multi-view 3D reconstruction.
arXiv Detail & Related papers (2024-12-01T08:53:39Z)
HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction [14.000919964212857]
Vision-based 3D semantic scene completion describes autonomous driving scenes through 3D volume representations. HybridOcc is a hybrid 3D volume query proposal method generated by Transformer framework and NeRF representation. We present an innovative occupancy-aware ray sampling method to orient the SSC task instead of focusing on the scene surface.
arXiv Detail & Related papers (2024-08-17T05:50:51Z)
Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks [53.67497327319569]
We introduce a novel neural rendering technique to solve image-to-3D from a single view. Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks. Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
arXiv Detail & Related papers (2023-12-24T08:42:37Z)
Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction [39.89856628467095]
We introduce the Structural MPI (S-MPI), where the plane structure approximates 3D scenes concisely. Despite the intuition and demand of applying S-MPI, great challenges are introduced, e.g., high-fidelity approximation for both RGBA layers and plane poses. Our method outperforms both previous state-of-the-art MPI-based view synthesis methods and planar reconstruction methods.
arXiv Detail & Related papers (2023-03-10T14:18:40Z)
VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction [64.09702079593372]
VolRecon is a novel generalizable implicit reconstruction method with Signed Ray Distance Function (SRDF) On DTU dataset, VolRecon outperforms SparseNeuS by about 30% in sparse view reconstruction and achieves comparable accuracy as MVSNet in full view reconstruction.
arXiv Detail & Related papers (2022-12-15T18:59:54Z)
High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization [51.878078860524795]
We present a high-fidelity 3D generative adversarial network (GAN) inversion framework that can synthesize photo-realistic novel views. Our approach enables high-fidelity 3D rendering from a single image, which is promising for various applications of AI-generated 3D content.
arXiv Detail & Related papers (2022-11-28T18:59:52Z)
Vision Transformer for NeRF-Based View Synthesis from a Single Input Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation. To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering. Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z)
VPFusion: Joint 3D Volume and Pixel-Aligned Feature Fusion for Single and Multi-view 3D Reconstruction [23.21446438011893]
VPFusionattains high-quality reconstruction using both - 3D feature volume to capture 3D-structure-aware context. Existing approaches use RNN, feature pooling, or attention computed independently in each view for multi-view fusion. We show improved multi-view feature fusion by establishing transformer-based pairwise view association.
arXiv Detail & Related papers (2022-03-14T23:30:58Z)
VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion [68.68537312256144]
VoRTX is an end-to-end volumetric 3D reconstruction network using transformers for wide-baseline, multi-view feature fusion. We train our model on ScanNet and show that it produces better reconstructions than state-of-the-art methods.
arXiv Detail & Related papers (2021-12-01T02:18:11Z)
Extracting Triangular 3D Models, Materials, and Lighting From Images [59.33666140713829]
We present an efficient method for joint optimization of materials and lighting from multi-view image observations. We leverage meshes with spatially-varying materials and environment that can be deployed in any traditional graphics engine.
arXiv Detail & Related papers (2021-11-24T13:58:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.