GMatch: A Lightweight, Geometry-Constrained Keypoint Matcher for Zero-Shot 6DoF Pose Estimation in Robotic Grasp Tasks
- URL: http://arxiv.org/abs/2505.16144v2
- Date: Sun, 19 Oct 2025 00:56:47 GMT
- Title: GMatch: A Lightweight, Geometry-Constrained Keypoint Matcher for Zero-Shot 6DoF Pose Estimation in Robotic Grasp Tasks
- Authors: Ming Yang, Haoran Li,
- Abstract summary: 6DoF object pose estimation is fundamental to robotic grasp tasks.<n>GMatch is a lightweight, geometry-constrained keypoint matcher that can run efficiently on embedded CPU-only platforms.
- Score: 9.107487205419604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 6DoF object pose estimation is fundamental to robotic grasp tasks. While recent learning-based methods achieve high accuracy, their computational demands hinder deployment on resource-constrained mobile platforms. In this work, we revisit the classical keypoint matching paradigm and propose GMatch, a lightweight, geometry-constrained keypoint matcher that can run efficiently on embedded CPU-only platforms. GMatch works with keypoint descriptors and it uses a set of geometric constraints to establishes inherent ambiguities between features extracted by descriptors, thus giving a globally consistent correspondences from which 6DoF pose can be easily solved. We benchmark GMatch on the HOPE and YCB-Video datasets, where our method beats existing keypoint matchers (both feature-based and geometry-based) among three commonly used descriptors and approaches the SOTA zero-shot method on texture-rich objects with much more humble devices. The method is further deployed on a LoCoBot mobile manipulator, enabling a one-shot grasp pipeline that demonstrates high task success rates in real-world experiments. In a word, by its lightweight and white-box nature, GMatch offers a practical solution for resource-limited robotic systems, and although currently bottlenecked by descriptor quality, the framework presents a promising direction towards robust yet efficient pose estimation. Code will be released soon under Mozilla Public License.
Related papers
- Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation [14.262846967061947]
Fine-grained Correspondence Pose Estimation (FiCoP) is a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence.<n>FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method.
arXiv Detail & Related papers (2026-01-20T03:48:54Z) - Leveraging CVAE for Joint Configuration Estimation of Multifingered Grippers from Point Cloud Data [1.3124513975412255]
This paper presents an efficient approach for determining the joint configuration of a multifingered gripper solely from the point cloud data of its poly-articulated chain.<n>We use Conditional Variational Auto-Encoder (CVAE) which takes point cloud data of key structural elements as input and reconstructs the corresponding joint configurations.<n>We validate our approach on the MultiDex grasping dataset using the Allegro Hand, operating within 0.05 milliseconds and achieving accuracy comparable to state-of-the-art methods.
arXiv Detail & Related papers (2025-11-21T14:31:39Z) - You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping [119.41166438439313]
YOEO is a single-stage method that outputs instance segmentation and NPCS representations in an end-to-end manner.<n>We use a unified network to generate point-wise semantic labels and centroid offsets, allowing points from the same part instance to vote for the same centroid.<n>We also deploy our synthetically-trained model in a real-world setting, providing real-time visual feedback at 200Hz.
arXiv Detail & Related papers (2025-06-06T03:49:20Z) - To Glue or Not to Glue? Classical vs Learned Image Matching for Mobile Mapping Cameras to Textured Semantic 3D Building Models [5.4693951128908935]
This work systematically evaluates the effectiveness of different feature-matching techniques in visual localization using textured CityGML LoD2 models.<n>The results indicate that the learnable feature matching methods vastly outperform traditional approaches regarding accuracy and robustness.
arXiv Detail & Related papers (2025-05-23T14:41:41Z) - Robust Markov stability for community detection at a scale learned based on the structure [0.0]
We propose a principled method to select a single robust partition at a suitable scale from the multiple partitions that PyGenStability produces.<n>Our proposed method combines the Markov stability framework with a pre-trained machine learning model for scale selection.<n>We show that PyGenStabilityOne (PO) outperforms 25 other algorithms by statistically meaningful margins.
arXiv Detail & Related papers (2025-04-15T21:16:14Z) - MatchU: Matching Unseen Objects for 6D Pose Estimation from RGB-D Images [57.71600854525037]
We propose a Fuse-Describe-Match strategy for 6D pose estimation from RGB-D images.
MatchU is a generic approach that fuses 2D texture and 3D geometric cues for 6D pose prediction of unseen objects.
arXiv Detail & Related papers (2024-03-03T14:01:03Z) - RGM: A Robust Generalizable Matching Model [49.60975442871967]
We propose a deep model for sparse and dense matching, termed RGM (Robust Generalist Matching)
To narrow the gap between synthetic training samples and real-world scenarios, we build a new, large-scale dataset with sparse correspondence ground truth.
We are able to mix up various dense and sparse matching datasets, significantly improving the training diversity.
arXiv Detail & Related papers (2023-10-18T07:30:08Z) - Q-REG: End-to-End Trainable Point Cloud Registration with Surface
Curvature [81.25511385257344]
We present a novel solution, Q-REG, which utilizes rich geometric information to estimate the rigid pose from a single correspondence.
Q-REG allows to formalize the robust estimation as an exhaustive search, hence enabling end-to-end training.
We demonstrate in the experiments that Q-REG is agnostic to the correspondence matching method and provides consistent improvement both when used only in inference and in end-to-end training.
arXiv Detail & Related papers (2023-09-27T20:58:53Z) - TTPOINT: A Tensorized Point Cloud Network for Lightweight Action
Recognition with Event Cameras [5.925545594655497]
Event cameras generate sparse and asynchronous data, which is incompatible with the traditional frame-based method.
We propose a point cloud network called TTPOINT which achieves results even compared to the state-of-the-art (SOTA) frame-based method in action recognition tasks.
arXiv Detail & Related papers (2023-08-19T11:58:31Z) - Explicit Correspondence Matching for Generalizable Neural Radiance
Fields [49.49773108695526]
We present a new NeRF method that is able to generalize to new unseen scenarios and perform novel view synthesis with as few as two source views.
The explicit correspondence matching is quantified with the cosine similarity between image features sampled at the 2D projections of a 3D point on different views.
Our method achieves state-of-the-art results on different evaluation settings, with the experiments showing a strong correlation between our learned cosine feature similarity and volume density.
arXiv Detail & Related papers (2023-04-24T17:46:01Z) - OAMatcher: An Overlapping Areas-based Network for Accurate Local Feature
Matching [9.006654114778073]
We propose OAMatcher, a detector-free method that imitates humans behavior to generate dense and accurate matches.
OAMatcher predicts overlapping areas to promote effective and clean global context aggregation.
Comprehensive experiments demonstrate that OAMatcher outperforms the state-of-the-art methods on several benchmarks.
arXiv Detail & Related papers (2023-02-12T03:32:45Z) - DeepMatcher: A Deep Transformer-based Network for Robust and Accurate
Local Feature Matching [9.662752427139496]
We propose a deep Transformer-based network built upon our investigation of local feature matching in detector-free methods.
DeepMatcher captures more human-intuitive and simpler-to-match features.
We show that DeepMatcher significantly outperforms the state-of-the-art methods on several benchmarks.
arXiv Detail & Related papers (2023-01-08T07:15:09Z) - Learning to Detect Good Keypoints to Match Non-Rigid Objects in RGB
Images [7.428474910083337]
We present a novel learned keypoint detection method designed to maximize the number of correct matches for the task of non-rigid image correspondence.
Our training framework uses true correspondences, obtained by matching annotated image pairs with a predefined descriptor extractor, as a ground-truth to train a convolutional neural network (CNN)
Experiments show that our method outperforms the state-of-the-art keypoint detector on real images of non-rigid objects by 20 p.p. on Mean Matching Accuracy.
arXiv Detail & Related papers (2022-12-13T11:59:09Z) - Adaptive Assignment for Geometry Aware Local Feature Matching [22.818457285745733]
detector-free feature matching approaches are currently attracting great attention thanks to their excellent performance.
We introduce AdaMatcher, which accomplishes the feature correlation and co-visible area estimation through an elaborate feature interaction module.
AdaMatcher then performs adaptive assignment on patch-level matching while estimating the scales between images, and finally refines the co-visible matches through scale alignment and sub-pixel regression module.
arXiv Detail & Related papers (2022-07-18T08:22:18Z) - TransforMatcher: Match-to-Match Attention for Semantic Correspondence [48.25709192748133]
We introduce a strong semantic image matching learner, dubbed TransforMatcher, which builds on the success of transformer networks in vision domains.
Unlike existing convolution- or attention-based schemes for correspondence, TransforMatcher performs global match-to-match attention for precise match localization and dynamic refinement.
In experiments, TransforMatcher sets a new state of the art on SPair-71k while performing on par with existing SOTA methods on the PF-PASCAL dataset.
arXiv Detail & Related papers (2022-05-23T21:02:01Z) - Semantic keypoint-based pose estimation from single RGB frames [64.80395521735463]
We present an approach to estimating the continuous 6-DoF pose of an object from a single RGB image.
The approach combines semantic keypoints predicted by a convolutional network (convnet) with a deformable shape model.
We show that our approach can accurately recover the 6-DoF object pose for both instance- and class-based scenarios.
arXiv Detail & Related papers (2022-04-12T15:03:51Z) - Progressive Coordinate Transforms for Monocular 3D Object Detection [52.00071336733109]
We propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
In this paper, we propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
arXiv Detail & Related papers (2021-08-12T15:22:33Z) - SIFT Matching by Context Exposed [7.99536002595393]
This paper investigates how to step up local image descriptor matching by exploiting matching context information.
A new matching strategy and a novel local spatial filter, named respectively blob matching and Delaunay Triangulation Matching (DTM) are devised.
DTM is comparable or better than the state-of-the-art in terms of matching accuracy and robustness, especially for non-planar scenes.
arXiv Detail & Related papers (2021-06-17T15:10:59Z) - 3D Correspondence Grouping with Compatibility Features [51.869670613445685]
We present a simple yet effective method for 3D correspondence grouping.
The objective is to accurately classify initial correspondences obtained by matching local geometric descriptors into inliers and outliers.
We propose a novel representation for 3D correspondences, dubbed compatibility feature (CF), to describe the consistencies within inliers and inconsistencies within outliers.
arXiv Detail & Related papers (2020-07-21T02:39:48Z) - Making Affine Correspondences Work in Camera Geometry Computation [62.7633180470428]
Local features provide region-to-region rather than point-to-point correspondences.
We propose guidelines for effective use of region-to-region matches in the course of a full model estimation pipeline.
Experiments show that affine solvers can achieve accuracy comparable to point-based solvers at faster run-times.
arXiv Detail & Related papers (2020-07-20T12:07:48Z) - PrimiTect: Fast Continuous Hough Voting for Primitive Detection [49.72425950418304]
Our method classifies points into different geometric primitives, such as planes and cones, leading to a compact representation of the data.
We use a local, low-dimensional parameterization of primitives to determine type, shape and pose of the object that a point belongs to.
This makes our algorithm suitable to run on devices with low computational power, as often required in robotics applications.
arXiv Detail & Related papers (2020-05-15T10:16:07Z) - Image Matching across Wide Baselines: From Paper to Practice [80.9424750998559]
We introduce a comprehensive benchmark for local features and robust estimation algorithms.
Our pipeline's modular structure allows easy integration, configuration, and combination of different methods.
We show that with proper settings, classical solutions may still outperform the perceived state of the art.
arXiv Detail & Related papers (2020-03-03T15:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.