Object Pose Estimation via the Aggregation of Diffusion Features
- URL: http://arxiv.org/abs/2403.18791v2
- Date: Sat, 1 Jun 2024 15:25:47 GMT
- Title: Object Pose Estimation via the Aggregation of Diffusion Features
- Authors: Tianfu Wang, Guosheng Hu, Hongguang Wang,
- Abstract summary: Estimating the pose of objects from images is a crucial task in 3D scene understanding.
Recent approaches experience a significant performance drop when dealing with unseen objects.
We propose three distinct architectures that can effectively capture and aggregate diffusion features of different granularity.
Our approach outperforms the state-of-the-art methods by a considerable margin on three popular benchmark datasets.
- Score: 25.119446464630037
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Estimating the pose of objects from images is a crucial task of 3D scene understanding, and recent approaches have shown promising results on very large benchmarks. However, these methods experience a significant performance drop when dealing with unseen objects. We believe that it results from the limited generalizability of image features. To address this problem, we have an in-depth analysis on the features of diffusion models, e.g. Stable Diffusion, which hold substantial potential for modeling unseen objects. Based on this analysis, we then innovatively introduce these diffusion features for object pose estimation. To achieve this, we propose three distinct architectures that can effectively capture and aggregate diffusion features of different granularity, greatly improving the generalizability of object pose estimation. Our approach outperforms the state-of-the-art methods by a considerable margin on three popular benchmark datasets, LM, O-LM, and T-LESS. In particular, our method achieves higher accuracy than the previous best arts on unseen objects: 98.2% vs. 93.5% on Unseen LM, 85.9% vs. 76.3% on Unseen O-LM, showing the strong generalizability of our method. Our code is released at https://github.com/Tianfu18/diff-feats-pose.
Related papers
- Diff9D: Diffusion-Based Domain-Generalized Category-Level 9-DoF Object Pose Estimation [68.81887041766373]
We introduce a diffusion-based paradigm for domain-generalized 9-DoF object pose estimation.
We propose an effective diffusion model to redefine 9-DoF object pose estimation from a generative perspective.
We show that our method achieves state-of-the-art domain generalization performance.
arXiv Detail & Related papers (2025-02-04T17:46:34Z) - Category Level 6D Object Pose Estimation from a Single RGB Image using Diffusion [9.025235713063509]
We tackle the harder problem of pose estimation for category-level objects from a single RGB image.
We propose a novel solution that eliminates the need for specific object models or depth information.
Our approach outperforms the current state-of-the-art on the REAL275 dataset by a significant margin.
arXiv Detail & Related papers (2024-12-16T03:39:33Z) - Diffusion Features for Zero-Shot 6DoF Object Pose Estimation [7.949705607963995]
This study assesses the influence of Latent Diffusion Model (LDM) backbones on zero-shot pose estimation.
A template-based multi-staged method for estimating poses in a zero-shot fashion using LDMs is presented.
arXiv Detail & Related papers (2024-11-25T18:53:56Z) - SEMPose: A Single End-to-end Network for Multi-object Pose Estimation [13.131534219937533]
SEMPose is an end-to-end multi-object pose estimation network.
It can perform inference at 32 FPS without requiring inputs other than the RGB image.
It can accurately estimate the poses of multiple objects in real time, with inference time unaffected by the number of target objects.
arXiv Detail & Related papers (2024-11-21T10:37:54Z) - PickScan: Object discovery and reconstruction from handheld interactions [99.99566882133179]
We develop an interaction-guided and class-agnostic method to reconstruct 3D representations of scenes.
Our main contribution is a novel approach to detecting user-object interactions and extracting the masks of manipulated objects.
Compared to Co-Fusion, the only comparable interaction-based and class-agnostic baseline, this corresponds to a reduction in chamfer distance of 73%.
arXiv Detail & Related papers (2024-11-17T23:09:08Z) - DVMNet: Computing Relative Pose for Unseen Objects Beyond Hypotheses [59.51874686414509]
Current approaches approximate the continuous pose representation with a large number of discrete pose hypotheses.
We present a Deep Voxel Matching Network (DVMNet) that eliminates the need for pose hypotheses and computes the relative object pose in a single pass.
Our method delivers more accurate relative pose estimates for novel objects at a lower computational cost compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-03-20T15:41:32Z) - FocalPose++: Focal Length and Object Pose Estimation via Render and Compare [35.388094104164175]
We introduce FocalPose++, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length.
We show results on three challenging benchmark datasets that depict known 3D models in uncontrolled settings.
arXiv Detail & Related papers (2023-11-15T13:28:02Z) - Diff-DOPE: Differentiable Deep Object Pose Estimation [29.703385848843414]
We introduce Diff-DOPE, a 6-DoF pose refiner that takes as input an image, a 3D textured model of an object, and an initial pose of the object.
The method uses differentiable rendering to update the object pose to minimize the visual error between the image and the projection of the model.
We show that this simple, yet effective, idea is able to achieve state-of-the-art results on pose estimation datasets.
arXiv Detail & Related papers (2023-09-30T18:52:57Z) - ZebraPose: Coarse to Fine Surface Encoding for 6DoF Object Pose
Estimation [76.31125154523056]
We present a discrete descriptor, which can represent the object surface densely.
We also propose a coarse to fine training strategy, which enables fine-grained correspondence prediction.
arXiv Detail & Related papers (2022-03-17T16:16:24Z) - Modeling Object Dissimilarity for Deep Saliency Prediction [86.14710352178967]
We introduce a detection-guided saliency prediction network that explicitly models the differences between multiple objects.
Our approach is general, allowing us to fuse our object dissimilarities with features extracted by any deep saliency prediction network.
arXiv Detail & Related papers (2021-04-08T16:10:37Z) - Objects are Different: Flexible Monocular 3D Object Detection [87.82253067302561]
We propose a flexible framework for monocular 3D object detection which explicitly decouples the truncated objects and adaptively combines multiple approaches for object depth estimation.
Experiments demonstrate that our method outperforms the state-of-the-art method by relatively 27% for the moderate level and 30% for the hard level in the test set of KITTI benchmark.
arXiv Detail & Related papers (2021-04-06T07:01:28Z) - Contemplating real-world object classification [53.10151901863263]
We reanalyze the ObjectNet dataset recently proposed by Barbu et al. containing objects in daily life situations.
We find that applying deep models to the isolated objects, rather than the entire scene as is done in the original paper, results in around 20-30% performance improvement.
arXiv Detail & Related papers (2021-03-08T23:29:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.