MonoSE(3)-Diffusion: A Monocular SE(3) Diffusion Framework for Robust Camera-to-Robot Pose Estimation
- URL: http://arxiv.org/abs/2510.10434v1
- Date: Sun, 12 Oct 2025 03:57:30 GMT
- Title: MonoSE(3)-Diffusion: A Monocular SE(3) Diffusion Framework for Robust Camera-to-Robot Pose Estimation
- Authors: Kangjian Zhu, Haobo Jiang, Yigong Zhang, Jianjun Qian, Jian Yang, Jin Xie,
- Abstract summary: We propose MonoSE(3)-Diffusion, a conditional denoising diffusion framework for robot pose estimation.<n>The framework consists of two processes: a visibility-constrained diffusion process for diverse pose augmentation and a timestep-aware reverse process for pose refinement.<n>Our approach demonstrates improvements across two benchmarks (DREAM and RoboKeyGen) representing a 32.3% gain over the state-of-the-art.
- Score: 39.15285671441867
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose MonoSE(3)-Diffusion, a monocular SE(3) diffusion framework that formulates markerless, image-based robot pose estimation as a conditional denoising diffusion process. The framework consists of two processes: a visibility-constrained diffusion process for diverse pose augmentation and a timestep-aware reverse process for progressive pose refinement. The diffusion process progressively perturbs ground-truth poses to noisy transformations for training a pose denoising network. Importantly, we integrate visibility constraints into the process, ensuring the transformations remain within the camera field of view. Compared to the fixed-scale perturbations used in current methods, the diffusion process generates in-view and diverse training poses, thereby improving the network generalization capability. Furthermore, the reverse process iteratively predicts the poses by the denoising network and refines pose estimates by sampling from the diffusion posterior of current timestep, following a scheduled coarse-to-fine procedure. Moreover, the timestep indicates the transformation scales, which guide the denoising network to achieve more accurate pose predictions. The reverse process demonstrates higher robustness than direct prediction, benefiting from its timestep-aware refinement scheme. Our approach demonstrates improvements across two benchmarks (DREAM and RoboKeyGen), achieving a notable AUC of 66.75 on the most challenging dataset, representing a 32.3% gain over the state-of-the-art.
Related papers
- RobustSplat++: Decoupling Densification, Dynamics, and Illumination for In-the-Wild 3DGS [85.90134051583368]
3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling.<n>Existing methods struggle with accurately modeling in-the-wild scenes affected by transient objects and illuminations.<n>We propose RobustSplat++, a robust solution based on several critical designs.
arXiv Detail & Related papers (2025-12-04T14:05:09Z) - ReNoise: Real Image Inversion Through Iterative Noising [62.96073631599749]
We introduce an inversion method with a high quality-to-operation ratio, enhancing reconstruction accuracy without increasing the number of operations.
We evaluate the performance of our ReNoise technique using various sampling algorithms and models, including recent accelerated diffusion models.
arXiv Detail & Related papers (2024-03-21T17:52:08Z) - Efficient Diffusion Model for Image Restoration by Residual Shifting [63.02725947015132]
This study proposes a novel and efficient diffusion model for image restoration.
Our method avoids the need for post-acceleration during inference, thereby avoiding the associated performance deterioration.
Our method achieves superior or comparable performance to current state-of-the-art methods on three classical IR tasks.
arXiv Detail & Related papers (2024-03-12T05:06:07Z) - Implicit Image-to-Image Schrodinger Bridge for Image Restoration [13.138398298354113]
We introduce the Implicit Image-to-Image Schr"odinger Bridge (I$3$SB) to further accelerate the generative process of I$2$SB.<n>I$3$SB restructures the generative process into a non-Markovian framework by incorporating the initial corrupted image at each generative step.<n>Compared to I$2$SB, I$3$SB achieves the same perceptual quality with fewer generative steps, while maintaining or improving fidelity to the ground truth.
arXiv Detail & Related papers (2024-03-10T03:22:57Z) - Cameras as Rays: Pose Estimation via Ray Diffusion [54.098613859015856]
Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparsely sampled views.
We propose a distributed representation of camera pose that treats a camera as a bundle of rays.
Our proposed methods, both regression- and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D.
arXiv Detail & Related papers (2024-02-22T18:59:56Z) - SE(3) Diffusion Model-based Point Cloud Registration for Robust 6D
Object Pose Estimation [66.16525145765604]
We introduce an SE(3) diffusion model-based point cloud registration framework for 6D object pose estimation in real-world scenarios.
Our approach formulates the 3D registration task as a denoising diffusion process, which progressively refines the pose of the source point cloud.
Experiments demonstrate that our diffusion registration framework presents outstanding pose estimation performance on the real-world TUD-L, LINEMOD, and Occluded-LINEMOD datasets.
arXiv Detail & Related papers (2023-10-26T12:47:26Z) - DiffPose: Toward More Reliable 3D Pose Estimation [11.6015323757147]
We propose a novel pose estimation framework (DiffPose) that formulates 3D pose estimation as a reverse diffusion process.
Our proposed DiffPose significantly outperforms existing methods on the widely used pose estimation benchmarks Human3.6M and MPI-INF-3DHP.
arXiv Detail & Related papers (2022-11-30T12:22:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.