Related papers: SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics

SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics

URL: http://arxiv.org/abs/2509.24572v1
Date: Mon, 29 Sep 2025 10:27:59 GMT
Title: SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics
Authors: Peter Hönig, Stefan Thalhammer, Jean-Baptiste Weibel, Matthias Hirschmanner, Markus Vincze,
Abstract summary: SCOPE is a diffusion-based category-level object pose estimation model.<n>It eliminates the need for discrete category labels by leveraging DINOv2 features as continuous semantic priors.<n>It achieves a relative improvement of 31.9% on the 5$circ$5cm metric.
Score: 8.467086312715892
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Object manipulation requires accurate object pose estimation. In open environments, robots encounter unknown objects, which requires semantic understanding in order to generalize both to known categories and beyond. To resolve this challenge, we present SCOPE, a diffusion-based category-level object pose estimation model that eliminates the need for discrete category labels by leveraging DINOv2 features as continuous semantic priors. By combining these DINOv2 features with photorealistic training data and a noise model for point normals, we reduce the Sim2Real gap in category-level object pose estimation. Furthermore, injecting the continuous semantic priors via cross-attention enables SCOPE to learn canonicalized object coordinate systems across object instances beyond the distribution of known categories. SCOPE outperforms the current state of the art in synthetically trained category-level object pose estimation, achieving a relative improvement of 31.9\% on the 5$^\circ$5cm metric. Additional experiments on two instance-level datasets demonstrate generalization beyond known object categories, enabling grasping of unseen objects from unknown categories with a success rate of up to 100\%. Code available: https://github.com/hoenigpeter/scope.

Related papers

Universal Features Guided Zero-Shot Category-Level Object Pose Estimation [52.29006019352873]
We propose a zero-shot method to achieve category-level 6-DOF object pose estimation.<n>Our method exploits both 2D and 3D universal features of input RGB-D image to establish semantic similarity-based correspondences.<n>Our method outperforms previous methods on the REAL275 and Wild6D benchmarks for unseen categories.
arXiv Detail & Related papers (2025-01-06T08:10:13Z)
You Only Look at One: Category-Level Object Representations for Pose Estimation From a Single Example [26.866356430469757]
We present a method for achieving category-level pose estimation by inspection of just a single object from a desired category. We demonstrate that our method runs in real-time, enabling a robot manipulator equipped with an RGBD sensor to perform online 6D pose estimation for novel objects.
arXiv Detail & Related papers (2023-05-22T01:32:24Z)
An Object SLAM Framework for Association, Mapping, and High-Level Tasks [12.62957558651032]
We present a comprehensive object SLAM framework that focuses on object-based perception and object-oriented robot tasks. A range of public datasets and real-world results have been used to evaluate the proposed object SLAM framework for its efficient performance.
arXiv Detail & Related papers (2023-05-12T08:10:14Z)
Pose for Everything: Towards Category-Agnostic Pose Estimation [93.07415325374761]
Category-Agnostic Pose Estimation (CAPE) aims to create a pose estimation model capable of detecting the pose of any class of object given only a few samples with keypoint definition. A transformer-based Keypoint Interaction Module (KIM) is proposed to capture both the interactions among different keypoints and the relationship between the support and query images. We also introduce Multi-category Pose (MP-100) dataset, which is a 2D pose dataset of 100 object categories containing over 20K instances and is well-designed for developing CAPE algorithms.
arXiv Detail & Related papers (2022-07-21T09:40:54Z)
Exploiting Unlabeled Data with Vision and Language Models for Object Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images. We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z)
On Hyperbolic Embeddings in 2D Object Detection [76.12912000278322]
We study whether a hyperbolic geometry better matches the underlying structure of the object classification space. We incorporate a hyperbolic classifier in two-stage, keypoint-based, and transformer-based object detection architectures. We observe categorical class hierarchies emerging in the structure of the classification space, resulting in lower classification errors and boosting the overall object detection performance.
arXiv Detail & Related papers (2022-03-15T16:43:40Z)
The Overlooked Classifier in Human-Object Interaction Recognition [82.20671129356037]
We encode the semantic correlation among classes into the classification head by initializing the weights with language embeddings of HOIs. We propose a new loss named LSE-Sign to enhance multi-label learning on a long-tailed dataset. Our simple yet effective method enables detection-free HOI classification, outperforming the state-of-the-arts that require object detection and human pose by a clear margin.
arXiv Detail & Related papers (2022-03-10T23:35:00Z)
SORNet: Spatial Object-Centric Representations for Sequential Manipulation [39.88239245446054]
Sequential manipulation tasks require a robot to perceive the state of an environment and plan a sequence of actions leading to a desired goal state. We propose SORNet, which extracts object-centric representations from RGB images conditioned on canonical views of the objects of interest.
arXiv Detail & Related papers (2021-09-08T19:36:29Z)
Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain. We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z)
Category-Level Articulated Object Pose Estimation [34.57672805595464]
We introduce Articulation-aware Normalized Coordinate Space Hierarchy (ANCSH) ANCSH is a canonical representation for different articulated objects in a given category. We develop a deep network based on PointNet++ that predicts ANCSH from a single depth point cloud.
arXiv Detail & Related papers (2019-12-26T18:34:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.