3D-MOV: Audio-Visual LSTM Autoencoder for 3D Reconstruction of Multiple
Objects from Video
- URL: http://arxiv.org/abs/2110.02404v1
- Date: Tue, 5 Oct 2021 23:23:19 GMT
- Title: 3D-MOV: Audio-Visual LSTM Autoencoder for 3D Reconstruction of Multiple
Objects from Video
- Authors: Justin Wilson and Ming C. Lin
- Abstract summary: We propose a multimodal single- and multi-frame neural network for 3D reconstructions using audio-visual inputs.
Our trained reconstruction LSTM autoencoder 3D-MOV accepts multiple inputs to account for a variety of surface types and views.
- Score: 29.26483070179999
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D object reconstructions of transparent and concave structured objects, with
inferred material properties, remains an open research problem for robot
navigation in unstructured environments. In this paper, we propose a multimodal
single- and multi-frame neural network for 3D reconstructions using
audio-visual inputs. Our trained reconstruction LSTM autoencoder 3D-MOV accepts
multiple inputs to account for a variety of surface types and views. Our neural
network produces high-quality 3D reconstructions using voxel representation.
Based on Intersection-over-Union (IoU), we evaluate against other baseline
methods using synthetic audio-visual datasets ShapeNet and Sound20K with impact
sounds and bounding box annotations. To the best of our knowledge, our single-
and multi-frame model is the first audio-visual reconstruction neural network
for 3D geometry and material representation.
Related papers
- Refine3DNet: Scaling Precision in 3D Object Reconstruction from Multi-View RGB Images using Attention [2.037112541541094]
We introduce a hybrid strategy featuring a visual auto-encoder with self-attention mechanisms and a 3D refiner network.
Our approach, combined with JTSO, outperforms state-of-the-art techniques in single and multi-view 3D reconstruction.
arXiv Detail & Related papers (2024-12-01T08:53:39Z) - MinD-3D: Reconstruct High-quality 3D objects in Human Brain [50.534007259536715]
Recon3DMind is an innovative task aimed at reconstructing 3D visuals from Functional Magnetic Resonance Imaging (fMRI) signals.
We present the fMRI-Shape dataset, which includes data from 14 participants and features 360-degree videos of 3D objects.
We propose MinD-3D, a novel and effective three-stage framework specifically designed to decode the brain's 3D visual information from fMRI signals.
arXiv Detail & Related papers (2023-12-12T18:21:36Z) - DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction
Model [86.37536249046943]
textbfDMV3D is a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion.
Our reconstruction model incorporates a triplane NeRF representation and can denoise noisy multi-view images via NeRF reconstruction and rendering.
arXiv Detail & Related papers (2023-11-15T18:58:41Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and
Quasi-Planar Segmentation [53.83313235792596]
We present a new methodology for real-time semantic mapping from RGB-D sequences.
It combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping.
Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems.
arXiv Detail & Related papers (2023-06-28T22:36:44Z) - Farm3D: Learning Articulated 3D Animals by Distilling 2D Diffusion [67.71624118802411]
We present Farm3D, a method for learning category-specific 3D reconstructors for articulated objects.
We propose a framework that uses an image generator, such as Stable Diffusion, to generate synthetic training data.
Our network can be used for analysis, including monocular reconstruction, or for synthesis, generating articulated assets for real-time applications such as video games.
arXiv Detail & Related papers (2023-04-20T17:59:34Z) - 3D-LatentMapper: View Agnostic Single-View Reconstruction of 3D Shapes [0.0]
We propose a novel framework that leverages the intermediate latent spaces of Vision Transformer (ViT) and a joint image-text representational model, CLIP, for fast and efficient Single View Reconstruction (SVR)
We use the ShapeNetV2 dataset and perform extensive experiments with comparisons to SOTA methods to demonstrate our method's effectiveness.
arXiv Detail & Related papers (2022-12-05T11:45:26Z) - Voxel-based 3D Detection and Reconstruction of Multiple Objects from a
Single Image [22.037472446683765]
We learn a regular grid of 3D voxel features from the input image which is aligned with 3D scene space via a 3D feature lifting operator.
Based on the 3D voxel features, our novel CenterNet-3D detection head formulates the 3D detection as keypoint detection in the 3D space.
We devise an efficient coarse-to-fine reconstruction module, including coarse-level voxelization and a novel local PCA-SDF shape representation.
arXiv Detail & Related papers (2021-11-04T18:30:37Z) - Pix2Vox++: Multi-scale Context-aware 3D Object Reconstruction from
Single and Multiple Images [56.652027072552606]
We propose a novel framework for single-view and multi-view 3D object reconstruction, named Pix2Vox++.
By using a well-designed encoder-decoder, it generates a coarse 3D volume from each input image.
A multi-scale context-aware fusion module is then introduced to adaptively select high-quality reconstructions for different parts from all coarse 3D volumes to obtain a fused 3D volume.
arXiv Detail & Related papers (2020-06-22T13:48:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.