Related papers: Geo6DPose: Fast Zero-Shot 6D Object Pose Estimation via Geometry-Filtered Feature Matching

Geo6DPose: Fast Zero-Shot 6D Object Pose Estimation via Geometry-Filtered Feature Matching

URL: http://arxiv.org/abs/2512.10674v1
Date: Thu, 11 Dec 2025 14:20:17 GMT
Title: Geo6DPose: Fast Zero-Shot 6D Object Pose Estimation via Geometry-Filtered Feature Matching
Authors: Javier Villena Toro, Mehdi Tarkian,
Abstract summary: Geo6DPose is a lightweight, fully local, and training-free pipeline for zero-shot 6D pose estimation.<n>Geo6DPose achieves sub-second inference on a single commodity GPU while matching the average recall of significantly larger zero-shot baselines.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent progress in zero-shot 6D object pose estimation has been driven largely by large-scale models and cloud-based inference. However, these approaches often introduce high latency, elevated energy consumption, and deployment risks related to connectivity, cost, and data governance; factors that conflict with the practical constraints of real-world robotics, where compute is limited and on-device inference is frequently required. We introduce Geo6DPose, a lightweight, fully local, and training-free pipeline for zero-shot 6D pose estimation that trades model scale for geometric reliability. Our method combines foundation model visual features with a geometric filtering strategy: Similarity maps are computed between onboarded template DINO descriptors and scene patches, and mutual correspondences are established by projecting scene patch centers to 3D and template descriptors to the object model coordinate system. Final poses are recovered via correspondence-driven RANSAC and ranked using a weighted geometric alignment metric that jointly accounts for reprojection consistency and spatial support, improving robustness to noise, clutter, and partial visibility. Geo6DPose achieves sub-second inference on a single commodity GPU while matching the average recall of significantly larger zero-shot baselines (53.7 AR, 1.08 FPS). It requires no training, fine-tuning, or network access, and remains compatible with evolving foundation backbones, advancing practical, fully local 6D perception for robotic deployment.

Related papers

TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction [57.46712611558817]
3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass.<n>Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry.<n>We propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies.
arXiv Detail & Related papers (2025-12-02T02:22:20Z)
OPFormer: Object Pose Estimation leveraging foundation model with geometric encoding [2.1987601456703474]
We introduce a unified, end-to-end framework that seamlessly integrates object detection and pose estimation.<n>Our system first employs the CNOS detector to localize target objects.<n>For each detection, our novel pose estimation module, OPFormer, infers the precise 6D pose.
arXiv Detail & Related papers (2025-11-16T14:19:52Z)
Beyond 'Templates': Category-Agnostic Object Pose, Size, and Shape Estimation from a Single View [69.6117755984012]
Estimating an object's 6D pose, size, and shape from visual input is a fundamental problem in computer vision.<n>We propose a unified category-agnostic framework that simultaneously predicts 6D pose, size, and dense shape from a single RGB-D image.
arXiv Detail & Related papers (2025-10-13T17:49:15Z)
CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image [86.75098349480014]
This paper tackles category-level pose estimation of articulated objects in robotic manipulation tasks.<n>We propose a single-stage Network, CAP-Net, for estimating the 6D poses and sizes of Categorical Articulated Parts.<n>We introduce the RGBD-Art dataset, the largest RGB-D articulated dataset to date, featuring RGB images and depth noise simulated from real sensors.
arXiv Detail & Related papers (2025-04-15T14:30:26Z)
Novel Object 6D Pose Estimation with a Single Reference View [51.61766126145154]
Existing novel object 6D pose estimation methods typically rely on CAD models or dense reference views.<n>We propose a Single-Reference-based novel object 6D (SinRef-6D) pose estimation method.<n>Our key idea is to iteratively establish point-wise alignment in a common coordinate system.
arXiv Detail & Related papers (2025-03-07T17:00:41Z)
FS6D: Few-Shot 6D Pose Estimation of Novel Objects [116.34922994123973]
6D object pose estimation networks are limited in their capability to scale to large numbers of object instances. In this work, we study a new open set problem; the few-shot 6D object poses estimation: estimating the 6D pose of an unknown object by a few support views without extra training.
arXiv Detail & Related papers (2022-03-28T10:31:29Z)
Category-Level 6D Object Pose Estimation via Cascaded Relation and Recurrent Reconstruction Networks [22.627704070200863]
Category-level 6D pose estimation is fundamental to many scenarios such as robotic manipulation and augmented reality. We achieve accurate category-level 6D pose estimation via cascaded relation and recurrent reconstruction networks. Our method exceeds the latest state-of-the-art SPD by $4.9%$ and $17.7%$ on the CAMERA25 dataset.
arXiv Detail & Related papers (2021-08-19T15:46:52Z)
Visual SLAM with Graph-Cut Optimized Multi-Plane Reconstruction [11.215334675788952]
This paper presents a semantic planar SLAM system that improves pose estimation and mapping using cues from an instance planar segmentation network. While the mainstream approaches are using RGB-D sensors, employing a monocular camera with such a system still faces challenges such as robust data association and precise geometric model fitting.
arXiv Detail & Related papers (2021-08-09T18:16:08Z)
GDRNPP: A Geometry-guided and Fully Learning-based Object Pose Estimator [51.89441403642665]
6D pose estimation of rigid objects is a long-standing and challenging task in computer vision.<n>Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses.<n>This paper introduces a fully learning-based object pose estimator.
arXiv Detail & Related papers (2021-02-24T09:11:31Z)
Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training [3.8073142980733]
We propose a novel framework for monocular 3D objects detection using only RGB images, called KM3D-Net. We design a fully convolutional model to predict object keypoints, dimension, and orientation, and then combine these estimations with perspective geometry constraints to compute position attribute.
arXiv Detail & Related papers (2020-09-02T00:51:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.