High-resolution open-vocabulary object 6D pose estimation
- URL: http://arxiv.org/abs/2406.16384v2
- Date: Thu, 11 Jul 2024 17:03:29 GMT
- Title: High-resolution open-vocabulary object 6D pose estimation
- Authors: Jaime Corsetti, Davide Boscaini, Francesco Giuliari, Changjae Oh, Andrea Cavallaro, Fabio Poiesi,
- Abstract summary: Horyon is an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object.
We evaluate our model on a benchmark with a large variety of unseen objects across four datasets.
- Score: 30.835921843505123
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.
Related papers
- OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation [56.028185293563325]
This paper studies a new open-set problem, the open-vocabulary category-level object pose and size estimation.
We first introduce OO3D-9D, a large-scale photorealistic dataset for this task.
We then propose a framework built on pre-trained DinoV2 and text-to-image stable diffusion models.
arXiv Detail & Related papers (2024-03-19T03:09:24Z) - Open-vocabulary object 6D pose estimation [31.863333447303273]
We introduce the new setting of open-vocabulary object 6D pose estimation, in which a textual prompt is used to specify the object of interest.
To operate in this setting, we introduce a novel approach that leverages a Vision-Language Model to segment the object of interest from the scenes.
We validate our approach on a new benchmark based on two popular datasets, REAL275 and Toyota-Light.
arXiv Detail & Related papers (2023-12-01T16:17:16Z) - ZS6D: Zero-shot 6D Object Pose Estimation using Vision Transformers [9.899633398596672]
We introduce ZS6D, for zero-shot novel object 6D pose estimation.
Visual descriptors, extracted using pre-trained Vision Transformers (ViT), are used for matching rendered templates.
Experiments are performed on LMO, YCBV, and TLESS datasets.
arXiv Detail & Related papers (2023-09-21T11:53:01Z) - Hierarchical Graph Neural Networks for Proprioceptive 6D Pose Estimation
of In-hand Objects [1.8263882169310044]
We introduce a hierarchical graph neural network architecture for combining multimodal (vision and touch) data.
We also introduce a hierarchical message passing operation that flows the information within and across modalities to learn a graph-based object representation.
arXiv Detail & Related papers (2023-06-28T01:18:53Z) - MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare [84.80956484848505]
MegaPose is a method to estimate the 6D pose of novel objects, that is, objects unseen during training.
We present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects.
Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner.
arXiv Detail & Related papers (2022-12-13T19:30:03Z) - Unseen Object 6D Pose Estimation: A Benchmark and Baselines [62.8809734237213]
We propose a new task that enables and facilitates algorithms to estimate the 6D pose estimation of novel objects during testing.
We collect a dataset with both real and synthetic images and up to 48 unseen objects in the test set.
By training an end-to-end 3D correspondences network, our method finds corresponding points between an unseen object and a partial view RGBD image accurately and efficiently.
arXiv Detail & Related papers (2022-06-23T16:29:53Z) - Coupled Iterative Refinement for 6D Multi-Object Pose Estimation [64.7198752089041]
Given a set of known 3D objects and an RGB or RGB-D input image, we detect and estimate the 6D pose of each object.
Our approach iteratively refines both pose and correspondence in a tightly coupled manner, allowing us to dynamically remove outliers to improve accuracy.
arXiv Detail & Related papers (2022-04-26T18:00:08Z) - FS6D: Few-Shot 6D Pose Estimation of Novel Objects [116.34922994123973]
6D object pose estimation networks are limited in their capability to scale to large numbers of object instances.
In this work, we study a new open set problem; the few-shot 6D object poses estimation: estimating the 6D pose of an unknown object by a few support views without extra training.
arXiv Detail & Related papers (2022-03-28T10:31:29Z) - CosyPose: Consistent multi-view multi-object 6D pose estimation [48.097599674329004]
We present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses.
Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images.
Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views.
arXiv Detail & Related papers (2020-08-19T14:11:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.