RoboEye: Enhancing 2D Robotic Object Identification with Selective 3D Geometric Keypoint Matching
- URL: http://arxiv.org/abs/2509.14966v1
- Date: Thu, 18 Sep 2025 13:59:24 GMT
- Title: RoboEye: Enhancing 2D Robotic Object Identification with Selective 3D Geometric Keypoint Matching
- Authors: Xingwu Zhang, Guanxuan Li, Zhuocheng Zhang, Zijun Long,
- Abstract summary: RoboEye is a framework that augments 2D semantic features with domain-adapted 3D reasoning and lightweight adapters.<n> Experiments show that RoboEye improves Recall@1 by 7.1% over the prior state of the art.
- Score: 5.240139281459202
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The rapidly growing number of product categories in large-scale e-commerce makes accurate object identification for automated packing in warehouses substantially more difficult. As the catalog grows, intra-class variability and a long tail of rare or visually similar items increase, and when combined with diverse packaging, cluttered containers, frequent occlusion, and large viewpoint changes-these factors amplify discrepancies between query and reference images, causing sharp performance drops for methods that rely solely on 2D appearance features. Thus, we propose RoboEye, a two-stage identification framework that dynamically augments 2D semantic features with domain-adapted 3D reasoning and lightweight adapters to bridge training deployment gaps. In the first stage, we train a large vision model to extract 2D features for generating candidate rankings. A lightweight 3D-feature-awareness module then estimates 3D feature quality and predicts whether 3D re-ranking is necessary, preventing performance degradation and avoiding unnecessary computation. When invoked, the second stage uses our robot 3D retrieval transformer, comprising a 3D feature extractor that produces geometry-aware dense features and a keypoint-based matcher that computes keypoint-correspondence confidences between query and reference images instead of conventional cosine-similarity scoring. Experiments show that RoboEye improves Recall@1 by 7.1% over the prior state of the art (RoboLLM). Moreover, RoboEye operates using only RGB images, avoiding reliance on explicit 3D inputs and reducing deployment costs. The code used in this paper is publicly available at: https://github.com/longkukuhi/RoboEye.
Related papers
- RoboTAG: End-to-end Robot Configuration Estimation via Topological Alignment Graph [62.270763554624615]
Estimating robot pose from a monocular RGB image is a challenge in robotics and computer vision.<n>Existing methods typically build networks on top of 2D visual backbones and depend heavily on labeled data for training.<n>We propose Robot Topological Alignment Graph (RoboTAG), which incorporates a 3D branch to inject 3D priors while enabling co-evolution of the 2D and 3D representations.
arXiv Detail & Related papers (2025-11-11T00:49:15Z) - 3DGS-CD: 3D Gaussian Splatting-based Change Detection for Physical Object Rearrangement [2.2122801766964795]
We present 3DGS-CD, the first 3D Gaussian Splatting (3DGS)-based method for detecting physical object rearrangements in 3D scenes.<n>Our approach estimates 3D object-level changes by comparing two sets of unaligned images taken at different times.<n>Our method can accurately identify changes in cluttered environments using sparse (as few as one) post-change images within as little as 18s.
arXiv Detail & Related papers (2024-11-06T07:08:41Z) - Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - GOEmbed: Gradient Origin Embeddings for Representation Agnostic 3D Feature Learning [67.61509647032862]
We propose GOEmbed (Gradient Origin Embeddings) that encodes input 2D images into any 3D representation.
Unlike typical prior approaches in which input images are encoded using 2D features extracted from large pre-trained models, or customized features are designed to handle different 3D representations.
arXiv Detail & Related papers (2023-12-14T08:39:39Z) - 3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images.
We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image.
We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z) - Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation [18.964403296437027]
Act3D represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand.
It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling.
arXiv Detail & Related papers (2023-06-30T17:34:06Z) - Bridged Transformer for Vision and Point Cloud 3D Object Detection [92.86856146086316]
Bridged Transformer (BrT) is an end-to-end architecture for 3D object detection.
BrT learns to identify 3D and 2D object bounding boxes from both points and image patches.
We experimentally show that BrT surpasses state-of-the-art methods on SUN RGB-D and ScanNetV2 datasets.
arXiv Detail & Related papers (2022-10-04T05:44:22Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.