Related papers: NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation

NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation

URL: http://arxiv.org/abs/2101.12378v2
Date: Tue, 2 Feb 2021 17:59:10 GMT
Title: NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation
Authors: Angtian Wang, Adam Kortylewski, Alan Yuille
Abstract summary: 3D pose estimation is a challenging but important task in computer vision. We show that standard deep learning approaches to 3D pose estimation are not robust when objects are partially occluded or viewed from a previously unseen pose. We propose to integrate deep neural networks with 3D generative representations of objects into a unified neural architecture that we term NeMo.
Score: 11.271053492520535
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 3D pose estimation is a challenging but important task in computer vision. In this work, we show that standard deep learning approaches to 3D pose estimation are not robust when objects are partially occluded or viewed from a previously unseen pose. Inspired by the robustness of generative vision models to partial occlusion, we propose to integrate deep neural networks with 3D generative representations of objects into a unified neural architecture that we term NeMo. In particular, NeMo learns a generative model of neural feature activations at each vertex on a dense 3D mesh. Using differentiable rendering we estimate the 3D object pose by minimizing the reconstruction error between NeMo and the feature representation of the target image. To avoid local optima in the reconstruction loss, we train the feature extractor to maximize the distance between the individual feature representations on the mesh using contrastive learning. Our extensive experiments on PASCAL3D+, occluded-PASCAL3D+ and ObjectNet3D show that NeMo is much more robust to partial occlusion and unseen pose compared to standard deep networks, while retaining competitive performance on regular data. Interestingly, our experiments also show that NeMo performs reasonably well even when the mesh representation only crudely approximates the true object geometry with a cuboid, hence revealing that the detailed 3D geometry is not needed for accurate 3D pose estimation. The code is publicly available at https://github.com/Angtian/NeMo.

Related papers

Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space [58.623106094568776]
3D morphable models (3DMMs) are a powerful tool to represent the possible shapes and appearances of an object category. We introduce a new method, Common3D, that learns 3DMMs of common objects in a fully self-supervised manner from a collection of object-centric videos. Common3D is the first completely self-supervised method that can solve various vision tasks in a zero-shot manner.
arXiv Detail & Related papers (2025-04-30T15:42:23Z)
DINeMo: Learning Neural Mesh Models with no 3D Annotations [7.21992608540601]
Category-level 3D/6D pose estimation is a crucial step towards comprehensive 3D scene understanding. Recent works explored neural mesh models that approach a range of 2D and 3D tasks from an analysis-by-synthesis perspective. We present DINeMo, a novel neural mesh model that is trained with no 3D annotations by leveraging pseudo-correspondence.
arXiv Detail & Related papers (2025-03-26T04:23:53Z)
Multi-Modal 3D Mesh Reconstruction from Images and Text [7.9471205712560264]
We propose a language-guided few-shot 3D reconstruction method, reconstructing a 3D mesh from few input images. We evaluate the method in terms of accuracy and quality of the geometry and texture.
arXiv Detail & Related papers (2025-03-10T11:18:17Z)
Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics. We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z)
DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z)
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts. Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets. We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z)
Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction [51.3632308129838]
We present Total-Decom, a novel method for decomposed 3D reconstruction with minimal human interaction. Our approach seamlessly integrates the Segment Anything Model (SAM) with hybrid implicit-explicit neural surface representations and a mesh-based region-growing technique for accurate 3D object decomposition. We extensively evaluate our method on benchmark datasets and demonstrate its potential for downstream applications, such as animation and scene editing.
arXiv Detail & Related papers (2024-03-28T11:12:33Z)
MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection [31.58403386994297]
We propose MonoNeRD, a novel detection framework that can infer dense 3D geometry and occupancy. Specifically, we model scenes with Signed Distance Functions (SDF), facilitating the production of dense 3D representations. To the best of our knowledge, this work is the first to introduce volume rendering for M3D, and demonstrates the potential of implicit reconstruction for image-based 3D perception.
arXiv Detail & Related papers (2023-08-18T09:39:52Z)
MoDA: Modeling Deformable 3D Objects from Casual Videos [84.29654142118018]
We propose neural dual quaternion blend skinning (NeuDBS) to achieve 3D point deformation without skin-collapsing artifacts. In the endeavor to register 2D pixels across different frames, we establish a correspondence between canonical feature embeddings that encodes 3D points within the canonical space. Our approach can reconstruct 3D models for humans and animals with better qualitative and quantitative performance than state-of-the-art methods.
arXiv Detail & Related papers (2023-04-17T13:49:04Z)
Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection. Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised. Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z)
An Effective Loss Function for Generating 3D Models from Single 2D Image without Rendering [0.0]
Differentiable rendering is a very successful technique that applies to a Single-View 3D Reconstruction. Currents use losses based on pixels between a rendered image of some 3D reconstructed object and ground-truth images from given matched viewpoints to optimise parameters of the 3D shape. We propose a novel effective loss function that evaluates how well the projections of reconstructed 3D point clouds cover the ground truth object's silhouette.
arXiv Detail & Related papers (2021-03-05T00:02:18Z)
Ground-aware Monocular 3D Object Detection for Autonomous Driving [6.5702792909006735]
Estimating the 3D position and orientation of objects in the environment with a single RGB camera is a challenging task for low-cost urban autonomous driving and mobile robots. Most of the existing algorithms are based on the geometric constraints in 2D-3D correspondence, which stems from generic 6D object pose estimation. We introduce a novel neural network module to fully utilize such application-specific priors in the framework of deep learning.
arXiv Detail & Related papers (2021-02-01T08:18:24Z)
MoNet3D: Towards Accurate Monocular 3D Object Localization in Real Time [15.245372936153277]
MoNet3D is a novel framework that can predict the 3D position of each object in a monocular image and draw a 3D bounding box for each object. The method can realize the real-time image processing at 27.85 FPS, showing promising potential for embedded advanced driving-assistance system applications.
arXiv Detail & Related papers (2020-06-29T12:48:57Z)
Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion [53.885984328273686]
Implicit Feature Networks (IF-Nets) deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data. IF-Nets clearly outperform prior work in 3D object reconstruction in ShapeNet, and obtain significantly more accurate 3D human reconstructions.
arXiv Detail & Related papers (2020-03-03T11:14:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.