MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model
- URL: http://arxiv.org/abs/2504.10433v1
- Date: Mon, 14 Apr 2025 17:21:10 GMT
- Title: MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model
- Authors: Jian Liu, Wei Sun, Hui Yang, Jin Zheng, Zichen Geng, Hossein Rahmani, Ajmal Mian,
- Abstract summary: We propose a diffusion-based monocular category-level 9D object pose generation method, MonoDiff9D.<n>We first estimate coarse depth via DINOv2 from the monocular image in a zero-shot manner and convert it into a point cloud.<n>We then fuse the global features of the point cloud with the input image and use the fused features along with the encoded time step to condition MonoDiff9D.
- Score: 34.52439917115497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Object pose estimation is a core means for robots to understand and interact with their environment. For this task, monocular category-level methods are attractive as they require only a single RGB camera. However, current methods rely on shape priors or CAD models of the intra-class known objects. We propose a diffusion-based monocular category-level 9D object pose generation method, MonoDiff9D. Our motivation is to leverage the probabilistic nature of diffusion models to alleviate the need for shape priors, CAD models, or depth sensors for intra-class unknown object pose estimation. We first estimate coarse depth via DINOv2 from the monocular image in a zero-shot manner and convert it into a point cloud. We then fuse the global features of the point cloud with the input image and use the fused features along with the encoded time step to condition MonoDiff9D. Finally, we design a transformer-based denoiser to recover the object pose from Gaussian noise. Extensive experiments on two popular benchmark datasets show that MonoDiff9D achieves state-of-the-art monocular category-level 9D object pose estimation accuracy without the need for shape priors or CAD models at any stage. Our code will be made public at https://github.com/CNJianLiu/MonoDiff9D.
Related papers
- Diff9D: Diffusion-Based Domain-Generalized Category-Level 9-DoF Object Pose Estimation [68.81887041766373]
We introduce a diffusion-based paradigm for domain-generalized 9-DoF object pose estimation.<n>We propose an effective diffusion model to redefine 9-DoF object pose estimation from a generative perspective.<n>We show that our method achieves state-of-the-art domain generalization performance.
arXiv Detail & Related papers (2025-02-04T17:46:34Z) - Category Level 6D Object Pose Estimation from a Single RGB Image using Diffusion [9.025235713063509]
We tackle the harder problem of pose estimation for category-level objects from a single RGB image.<n>We propose a novel solution that eliminates the need for specific object models or depth information.<n>Our approach outperforms the current state-of-the-art on the REAL275 dataset by a significant margin.
arXiv Detail & Related papers (2024-12-16T03:39:33Z) - Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics.
We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z) - Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image [85.91935485902708]
We show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models.
We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models.
Our method enables the accurate recovery of metric 3D structures on randomly collected internet images.
arXiv Detail & Related papers (2023-07-20T16:14:23Z) - OnePose++: Keypoint-Free One-Shot Object Pose Estimation without CAD
Models [51.68715543630427]
OnePose relies on detecting repeatable image keypoints and is thus prone to failure on low-textured objects.
We propose a keypoint-free pose estimation pipeline to remove the need for repeatable keypoint detection.
A 2D-3D matching network directly establishes 2D-3D correspondences between the query image and the reconstructed point-cloud model.
arXiv Detail & Related papers (2023-01-18T17:47:13Z) - Unseen Object 6D Pose Estimation: A Benchmark and Baselines [62.8809734237213]
We propose a new task that enables and facilitates algorithms to estimate the 6D pose estimation of novel objects during testing.
We collect a dataset with both real and synthetic images and up to 48 unseen objects in the test set.
By training an end-to-end 3D correspondences network, our method finds corresponding points between an unseen object and a partial view RGBD image accurately and efficiently.
arXiv Detail & Related papers (2022-06-23T16:29:53Z) - Category-Agnostic 6D Pose Estimation with Conditional Neural Processes [19.387280883044482]
We present a novel meta-learning approach for 6D pose estimation on unknown objects.
Our algorithm learns object representation in a category-agnostic way, which endows it with strong generalization capabilities across object categories.
arXiv Detail & Related papers (2022-06-14T20:46:09Z) - CAPTRA: CAtegory-level Pose Tracking for Rigid and Articulated Objects
from Point Clouds [97.63549045541296]
We propose a unified framework that can handle 9DoF pose tracking for novel rigid object instances and per-part pose tracking for articulated objects.
Our method achieves new state-of-the-art performance on category-level rigid object pose (NOCS-REAL275) and articulated object pose benchmarks (SAPIEN, BMVC) at the fastest FPS 12.
arXiv Detail & Related papers (2021-04-08T00:14:58Z) - Shape Prior Deformation for Categorical 6D Object Pose and Size
Estimation [62.618227434286]
We present a novel learning approach to recover the 6D poses and sizes of unseen object instances from an RGB-D image.
We propose a deep network to reconstruct the 3D object model by explicitly modeling the deformation from a pre-learned categorical shape prior.
arXiv Detail & Related papers (2020-07-16T16:45:05Z) - Learning Canonical Shape Space for Category-Level 6D Object Pose and
Size Estimation [21.7030393344051]
We learn canonical shape space (CASS), a unified representation for a large variety of instances of a certain object category.
We train a variational auto-encoder (VAE) for generating 3D point clouds in the canonical space from an RGBD image.
VAE is trained in a cross-category fashion, exploiting the publicly available large 3D shape repositories.
arXiv Detail & Related papers (2020-01-25T14:16:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.