NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose
Estimation
- URL: http://arxiv.org/abs/2101.12378v2
- Date: Tue, 2 Feb 2021 17:59:10 GMT
- Title: NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose
Estimation
- Authors: Angtian Wang, Adam Kortylewski, Alan Yuille
- Abstract summary: 3D pose estimation is a challenging but important task in computer vision.
We show that standard deep learning approaches to 3D pose estimation are not robust when objects are partially occluded or viewed from a previously unseen pose.
We propose to integrate deep neural networks with 3D generative representations of objects into a unified neural architecture that we term NeMo.
- Score: 11.271053492520535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D pose estimation is a challenging but important task in computer vision. In
this work, we show that standard deep learning approaches to 3D pose estimation
are not robust when objects are partially occluded or viewed from a previously
unseen pose. Inspired by the robustness of generative vision models to partial
occlusion, we propose to integrate deep neural networks with 3D generative
representations of objects into a unified neural architecture that we term
NeMo. In particular, NeMo learns a generative model of neural feature
activations at each vertex on a dense 3D mesh. Using differentiable rendering
we estimate the 3D object pose by minimizing the reconstruction error between
NeMo and the feature representation of the target image. To avoid local optima
in the reconstruction loss, we train the feature extractor to maximize the
distance between the individual feature representations on the mesh using
contrastive learning. Our extensive experiments on PASCAL3D+,
occluded-PASCAL3D+ and ObjectNet3D show that NeMo is much more robust to
partial occlusion and unseen pose compared to standard deep networks, while
retaining competitive performance on regular data. Interestingly, our
experiments also show that NeMo performs reasonably well even when the mesh
representation only crudely approximates the true object geometry with a
cuboid, hence revealing that the detailed 3D geometry is not needed for
accurate 3D pose estimation. The code is publicly available at
https://github.com/Angtian/NeMo.
Related papers
- Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics.
We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z) - DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction [51.3632308129838]
We present Total-Decom, a novel method for decomposed 3D reconstruction with minimal human interaction.
Our approach seamlessly integrates the Segment Anything Model (SAM) with hybrid implicit-explicit neural surface representations and a mesh-based region-growing technique for accurate 3D object decomposition.
We extensively evaluate our method on benchmark datasets and demonstrate its potential for downstream applications, such as animation and scene editing.
arXiv Detail & Related papers (2024-03-28T11:12:33Z) - MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection [31.58403386994297]
We propose MonoNeRD, a novel detection framework that can infer dense 3D geometry and occupancy.
Specifically, we model scenes with Signed Distance Functions (SDF), facilitating the production of dense 3D representations.
To the best of our knowledge, this work is the first to introduce volume rendering for M3D, and demonstrates the potential of implicit reconstruction for image-based 3D perception.
arXiv Detail & Related papers (2023-08-18T09:39:52Z) - MoDA: Modeling Deformable 3D Objects from Casual Videos [84.29654142118018]
We propose neural dual quaternion blend skinning (NeuDBS) to achieve 3D point deformation without skin-collapsing artifacts.
In the endeavor to register 2D pixels across different frames, we establish a correspondence between canonical feature embeddings that encodes 3D points within the canonical space.
Our approach can reconstruct 3D models for humans and animals with better qualitative and quantitative performance than state-of-the-art methods.
arXiv Detail & Related papers (2023-04-17T13:49:04Z) - Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection.
Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised.
Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z) - An Effective Loss Function for Generating 3D Models from Single 2D Image
without Rendering [0.0]
Differentiable rendering is a very successful technique that applies to a Single-View 3D Reconstruction.
Currents use losses based on pixels between a rendered image of some 3D reconstructed object and ground-truth images from given matched viewpoints to optimise parameters of the 3D shape.
We propose a novel effective loss function that evaluates how well the projections of reconstructed 3D point clouds cover the ground truth object's silhouette.
arXiv Detail & Related papers (2021-03-05T00:02:18Z) - Ground-aware Monocular 3D Object Detection for Autonomous Driving [6.5702792909006735]
Estimating the 3D position and orientation of objects in the environment with a single RGB camera is a challenging task for low-cost urban autonomous driving and mobile robots.
Most of the existing algorithms are based on the geometric constraints in 2D-3D correspondence, which stems from generic 6D object pose estimation.
We introduce a novel neural network module to fully utilize such application-specific priors in the framework of deep learning.
arXiv Detail & Related papers (2021-02-01T08:18:24Z) - MoNet3D: Towards Accurate Monocular 3D Object Localization in Real Time [15.245372936153277]
MoNet3D is a novel framework that can predict the 3D position of each object in a monocular image and draw a 3D bounding box for each object.
The method can realize the real-time image processing at 27.85 FPS, showing promising potential for embedded advanced driving-assistance system applications.
arXiv Detail & Related papers (2020-06-29T12:48:57Z) - Implicit Functions in Feature Space for 3D Shape Reconstruction and
Completion [53.885984328273686]
Implicit Feature Networks (IF-Nets) deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data.
IF-Nets clearly outperform prior work in 3D object reconstruction in ShapeNet, and obtain significantly more accurate 3D human reconstructions.
arXiv Detail & Related papers (2020-03-03T11:14:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.