CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation
- URL: http://arxiv.org/abs/2511.12919v2
- Date: Sat, 22 Nov 2025 13:25:12 GMT
- Title: CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation
- Authors: Dexin Zuo, Ang Li, Wei Wang, Wenxian Yu, Danping Zou,
- Abstract summary: CoordAR is a novel autoregressive framework for one-reference 6D pose estimation of unseen objects.<n>We introduce 1) a novel coordinate map tokenization that enables probabilistic prediction over discretized 3D space; 2) a modality-decoupled encoding strategy that separately encodes RGB appearance and coordinate cues; and 3) an autoregressive transformer decoder conditioned on both position-aligned query features.
- Score: 20.453498343557026
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Object 6D pose estimation, a crucial task for robotics and augmented reality applications, becomes particularly challenging when dealing with novel objects whose 3D models are not readily available. To reduce dependency on 3D models, recent studies have explored one-reference-based pose estimation, which requires only a single reference view instead of a complete 3D model. However, existing methods that rely on real-valued coordinate regression suffer from limited global consistency due to the local nature of convolutional architectures and face challenges in symmetric or occluded scenarios owing to a lack of uncertainty modeling. We present CoordAR, a novel autoregressive framework for one-reference 6D pose estimation of unseen objects. CoordAR formulates 3D-3D correspondences between the reference and query views as a map of discrete tokens, which is obtained in an autoregressive and probabilistic manner. To enable accurate correspondence regression, CoordAR introduces 1) a novel coordinate map tokenization that enables probabilistic prediction over discretized 3D space; 2) a modality-decoupled encoding strategy that separately encodes RGB appearance and coordinate cues; and 3) an autoregressive transformer decoder conditioned on both position-aligned query features and the partially generated token sequence. With these novel mechanisms, CoordAR significantly outperforms existing methods on multiple benchmarks and demonstrates strong robustness to symmetry, occlusion, and other challenges in real-world tests.
Related papers
- Mono3DV: Monocular 3D Object Detection with 3D-Aware Bipartite Matching and Variational Query DeNoising [0.6423989407081764]
Mono3DV is a novel Transformer-based framework for 3D object detection.<n>We develop a 3D-Aware Bipartite Matching strategy that directly incorporates 3D geometric information into the matching cost.<n>Second, it is important to stabilize the Bipartite Matching to resolve the instability occurring when integrating 3D attributes.
arXiv Detail & Related papers (2026-01-03T02:06:28Z) - Zero-shot Inexact CAD Model Alignment from a Single Image [53.37898107159792]
A practical approach to infer 3D scene structure from a single image is to retrieve a closely matching 3D model from a database and align it with the object in the image.<n>Existing methods rely on supervised training with images and pose annotations, which limits them to a narrow set of object categories.<n>We propose a weakly supervised 9-DoF alignment method for inexact 3D models that requires no pose annotations and generalizes to unseen categories.
arXiv Detail & Related papers (2025-07-04T04:46:59Z) - UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image [86.7128543480229]
Unseen object pose estimation methods often rely on CAD models or multiple reference views.<n>To simplify reference acquisition, we aim to estimate the unseen object's pose through a single unposed RGB-D reference image.<n>We present a novel approach and benchmark, termed UNOPose, for unseen one-reference-based object pose estimation.
arXiv Detail & Related papers (2024-11-25T05:36:00Z) - No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images [100.80376573969045]
NoPoSplat is a feed-forward model capable of reconstructing 3D scenes parameterized by 3D Gaussians from multi-view images.
Our model achieves real-time 3D Gaussian reconstruction during inference.
This work makes significant advances in pose-free generalizable 3D reconstruction and demonstrates its applicability to real-world scenarios.
arXiv Detail & Related papers (2024-10-31T17:58:22Z) - iComMa: Inverting 3D Gaussian Splatting for Camera Pose Estimation via Comparing and Matching [14.737266480464156]
We present a method named iComMa to address the 6D camera pose estimation problem in computer vision.
We propose an efficient method for accurate camera pose estimation by inverting 3D Gaussian Splatting (3DGS)
arXiv Detail & Related papers (2023-12-14T15:31:33Z) - A Probabilistic Attention Model with Occlusion-aware Texture Regression
for 3D Hand Reconstruction from a Single RGB Image [5.725477071353354]
Deep learning approaches have shown promising results in 3D hand reconstruction from a single RGB image.
We propose a novel probabilistic model to achieve the robustness of model-based approaches.
We demonstrate the flexibility of the proposed probabilistic model to be trained in both supervised and weakly-supervised scenarios.
arXiv Detail & Related papers (2023-04-27T16:02:32Z) - NeurAR: Neural Uncertainty for Autonomous 3D Reconstruction [64.36535692191343]
Implicit neural representations have shown compelling results in offline 3D reconstruction and also recently demonstrated the potential for online SLAM systems.
This paper addresses two key challenges: 1) seeking a criterion to measure the quality of the candidate viewpoints for the view planning based on the new representations, and 2) learning the criterion from data that can generalize to different scenes instead of hand-crafting one.
Our method demonstrates significant improvements on various metrics for the rendered image quality and the geometry quality of the reconstructed 3D models when compared with variants using TSDF or reconstruction without view planning.
arXiv Detail & Related papers (2022-07-22T10:05:36Z) - A Dual-Masked Auto-Encoder for Robust Motion Capture with
Spatial-Temporal Skeletal Token Completion [13.88656793940129]
We propose an adaptive, identity-aware triangulation module to reconstruct 3D joints and identify the missing joints for each identity.
We then propose a Dual-Masked Auto-Encoder (D-MAE) which encodes the joint status with both skeletal-structural and temporal position encoding for trajectory completion.
In order to demonstrate the proposed model's capability in dealing with severe data loss scenarios, we contribute a high-accuracy and challenging motion capture dataset.
arXiv Detail & Related papers (2022-07-15T10:00:43Z) - Non-Local Latent Relation Distillation for Self-Adaptive 3D Human Pose
Estimation [63.199549837604444]
3D human pose estimation approaches leverage different forms of strong (2D/3D pose) or weak (multi-view or depth) paired supervision.
We cast 3D pose learning as a self-supervised adaptation problem that aims to transfer the task knowledge from a labeled source domain to a completely unpaired target.
We evaluate different self-adaptation settings and demonstrate state-of-the-art 3D human pose estimation performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-05T03:52:57Z) - Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose
Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations.
We derive suitable measures to quantify prediction uncertainty at both pose and joint level.
We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.