OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation
- URL: http://arxiv.org/abs/2403.12396v1
- Date: Tue, 19 Mar 2024 03:09:24 GMT
- Title: OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation
- Authors: Junhao Cai, Yisheng He, Weihao Yuan, Siyu Zhu, Zilong Dong, Liefeng Bo, Qifeng Chen,
- Abstract summary: This paper studies a new open-set problem, the open-vocabulary category-level object pose and size estimation.
We first introduce OO3D-9D, a large-scale photorealistic dataset for this task.
We then propose a framework built on pre-trained DinoV2 and text-to-image stable diffusion models.
- Score: 56.028185293563325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies a new open-set problem, the open-vocabulary category-level object pose and size estimation. Given human text descriptions of arbitrary novel object categories, the robot agent seeks to predict the position, orientation, and size of the target object in the observed scene image. To enable such generalizability, we first introduce OO3D-9D, a large-scale photorealistic dataset for this task. Derived from OmniObject3D, OO3D-9D is the largest and most diverse dataset in the field of category-level object pose and size estimation. It includes additional annotations for the symmetry axis of each category, which help resolve symmetric ambiguity. Apart from the large-scale dataset, we find another key to enabling such generalizability is leveraging the strong prior knowledge in pre-trained visual-language foundation models. We then propose a framework built on pre-trained DinoV2 and text-to-image stable diffusion models to infer the normalized object coordinate space (NOCS) maps of the target instances. This framework fully leverages the visual semantic prior from DinoV2 and the aligned visual and language knowledge within the text-to-image diffusion model, which enables generalization to various text descriptions of novel categories. Comprehensive quantitative and qualitative experiments demonstrate that the proposed open-vocabulary method, trained on our large-scale synthesized data, significantly outperforms the baseline and can effectively generalize to real-world images of unseen categories. The project page is at https://ov9d.github.io.
Related papers
- GS-Pose: Category-Level Object Pose Estimation via Geometric and
Semantic Correspondence [5.500735640045456]
Category-level pose estimation is a challenging task with many potential applications in computer vision and robotics.
We propose to utilize both geometric and semantic features obtained from a pre-trained foundation model.
This requires significantly less data to train than prior methods since the semantic features are robust to object texture and appearance.
arXiv Detail & Related papers (2023-11-23T02:35:38Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare [84.80956484848505]
MegaPose is a method to estimate the 6D pose of novel objects, that is, objects unseen during training.
We present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects.
Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner.
arXiv Detail & Related papers (2022-12-13T19:30:03Z) - Generative Category-Level Shape and Pose Estimation with Semantic
Primitives [27.692997522812615]
We propose a novel framework for category-level object shape and pose estimation from a single RGB-D image.
To handle the intra-category variation, we adopt a semantic primitive representation that encodes diverse shapes into a unified latent space.
We show that the proposed method achieves SOTA pose estimation performance and better generalization in the real-world dataset.
arXiv Detail & Related papers (2022-10-03T17:51:54Z) - Category-Agnostic 6D Pose Estimation with Conditional Neural Processes [19.387280883044482]
We present a novel meta-learning approach for 6D pose estimation on unknown objects.
Our algorithm learns object representation in a category-agnostic way, which endows it with strong generalization capabilities across object categories.
arXiv Detail & Related papers (2022-06-14T20:46:09Z) - Object Pose Estimation using Mid-level Visual Representations [5.220940151628735]
This work proposes a novel pose estimation model for object categories that can be effectively transferred to previously unseen environments.
Deep convolutional network models (CNN) for pose estimation are typically trained and evaluated on datasets curated for object detection, pose estimation, or 3D reconstruction.
We show that the approach is favorable when it comes to generalization and transfer to novel environments.
arXiv Detail & Related papers (2022-03-02T22:49:17Z) - DONet: Learning Category-Level 6D Object Pose and Size Estimation from
Depth Observation [53.55300278592281]
We propose a method of Category-level 6D Object Pose and Size Estimation (COPSE) from a single depth image.
Our framework makes inferences based on the rich geometric information of the object in the depth channel alone.
Our framework competes with state-of-the-art approaches that require labeled real-world images.
arXiv Detail & Related papers (2021-06-27T10:41:50Z) - Single View Metrology in the Wild [94.7005246862618]
We present a novel approach to single view metrology that can recover the absolute scale of a scene represented by 3D heights of objects or camera height above the ground.
Our method relies on data-driven priors learned by a deep network specifically designed to imbibe weakly supervised constraints from the interplay of the unknown camera with 3D entities such as object heights.
We demonstrate state-of-the-art qualitative and quantitative results on several datasets as well as applications including virtual object insertion.
arXiv Detail & Related papers (2020-07-18T22:31:33Z) - Shape Prior Deformation for Categorical 6D Object Pose and Size
Estimation [62.618227434286]
We present a novel learning approach to recover the 6D poses and sizes of unseen object instances from an RGB-D image.
We propose a deep network to reconstruct the 3D object model by explicitly modeling the deformation from a pre-learned categorical shape prior.
arXiv Detail & Related papers (2020-07-16T16:45:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.