Yolo-Key-6D: Single Stage Monocular 6D Pose Estimation with Keypoint Enhancements
- URL: http://arxiv.org/abs/2603.03879v1
- Date: Wed, 04 Mar 2026 09:31:07 GMT
- Title: Yolo-Key-6D: Single Stage Monocular 6D Pose Estimation with Keypoint Enhancements
- Authors: Kemal Alperen Çetiner, Hazım Kemal Ekenel,
- Abstract summary: We present Yolo-Key-6D, a novel single stage, end-to-end framework for monocular 6D pose estimation.<n>Our approach enhances a YOLO based architecture by integrating an auxiliary head that regresses the 2D projections of an object's 3D bounding box corners.<n>YOLO-Key-6D achieves competitive accuracy scores of 96.24% and 69.41%, respectively, with the ADD(-S) 0.1d metric.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Estimating the 6D pose of objects from a single RGB image is a critical task for robotics and extended reality applications. However, state-of-the-art multi stage methods often suffer from high latency, making them unsuitable for real time use. In this paper, we present Yolo-Key-6D, a novel single stage, end-to-end framework for monocular 6D pose estimation designed for both speed and accuracy. Our approach enhances a YOLO based architecture by integrating an auxiliary head that regresses the 2D projections of an object's 3D bounding box corners. This keypoint detection task significantly improves the network's understanding of 3D geometry. For stable end-to-end training, we directly regress rotation using a continuous 9D representation projected to SO(3) via singular value decomposition. On the LINEMOD and LINEMOD-Occluded benchmarks, YOLO-Key-6D achieves competitive accuracy scores of 96.24% and 69.41%, respectively, with the ADD(-S) 0.1d metric, while proving itself to operate in real time. Our results demonstrate that a carefully designed single stage method can provide a practical and effective balance of performance and efficiency for real world deployment.
Related papers
- Any6D: Model-free 6D Pose Estimation of Novel Objects [76.30057578269668]
We introduce Any6D, a model-free framework for 6D object pose estimation.<n>It requires only a single RGB-D anchor image to estimate both the 6D pose and size of unknown objects in novel scenes.<n>We evaluate our method on five challenging datasets.
arXiv Detail & Related papers (2025-03-24T13:46:21Z) - 6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting [7.7145084897748974]
We present 6DOPE-GS, a novel method for online 6D object pose estimation & tracking with a single RGB-D camera.<n>We show that 6DOPE-GS matches the performance of state-of-the-art baselines for model-free simultaneous 6D pose tracking and reconstruction.<n>We also demonstrate the method's suitability for live, dynamic object tracking and reconstruction in a real-world setting.
arXiv Detail & Related papers (2024-12-02T14:32:19Z) - FSD: Fast Self-Supervised Single RGB-D to Categorical 3D Objects [37.175069234979645]
This work addresses the challenging task of 3D object recognition without the reliance on real-world 3D labeled data.
Our goal is to predict the 3D shape, size, and 6D pose of objects within a single RGB-D image, operating at the category level and eliminating the need for CAD models during inference.
arXiv Detail & Related papers (2023-10-19T17:59:09Z) - UniPAD: A Universal Pre-training Paradigm for Autonomous Driving [74.34701012543968]
We present UniPAD, a novel self-supervised learning paradigm applying 3D differentiable rendering.
UniPAD implicitly encodes 3D space, facilitating the reconstruction of continuous 3D shape structures.
Our method significantly improves lidar-, camera-, and lidar-camera-based baseline by 9.1, 7.7, and 6.9 NDS, respectively.
arXiv Detail & Related papers (2023-10-12T14:39:58Z) - 6D Pose Estimation with Combined Deep Learning and 3D Vision Techniques
for a Fast and Accurate Object Grasping [0.19686770963118383]
Real-time robotic grasping is a priority target for highly advanced autonomous systems.
This paper proposes a novel method with a 2-stage approach that combines a fast 2D object recognition using a deep neural network.
The proposed solution has a potential to perform robustly on real-time applications, requiring both efficiency and accuracy.
arXiv Detail & Related papers (2021-11-11T15:36:55Z) - Learning Stereopsis from Geometric Synthesis for 6D Object Pose
Estimation [11.999630902627864]
Current monocular-based 6D object pose estimation methods generally achieve less competitive results than RGBD-based methods.
This paper proposes a 3D geometric volume based pose estimation method with a short baseline two-view setting.
Experiments show that our method outperforms state-of-the-art monocular-based methods, and is robust in different objects and scenes.
arXiv Detail & Related papers (2021-09-25T02:55:05Z) - SO-Pose: Exploiting Self-Occlusion for Direct 6D Pose Estimation [98.83762558394345]
SO-Pose is a framework for regressing all 6 degrees-of-freedom (6DoF) for the object pose in a cluttered environment from a single RGB image.
We introduce a novel reasoning about self-occlusion, in order to establish a two-layer representation for 3D objects.
Cross-layer consistencies that align correspondences, self-occlusion and 6D pose, we can further improve accuracy and robustness.
arXiv Detail & Related papers (2021-08-18T19:49:29Z) - GDRNPP: A Geometry-guided and Fully Learning-based Object Pose Estimator [51.89441403642665]
6D pose estimation of rigid objects is a long-standing and challenging task in computer vision.<n>Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses.<n>This paper introduces a fully learning-based object pose estimator.
arXiv Detail & Related papers (2021-02-24T09:11:31Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - 3DSSD: Point-based 3D Single Stage Object Detector [61.67928229961813]
We present a point-based 3D single stage object detector, named 3DSSD, achieving a good balance between accuracy and efficiency.
Our method outperforms all state-of-the-art voxel-based single stage methods by a large margin, and has comparable performance to two stage point-based methods as well.
arXiv Detail & Related papers (2020-02-24T12:01:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.