Bi-Manual Joint Camera Calibration and Scene Representation
- URL: http://arxiv.org/abs/2505.24819v1
- Date: Fri, 30 May 2025 17:22:00 GMT
- Title: Bi-Manual Joint Camera Calibration and Scene Representation
- Authors: Haozhan Tang, Tianyi Zhang, Matthew Johnson-Roberson, Weiming Zhi,
- Abstract summary: We introduce the Bi-Manual Joint and Representation Framework (Bi-JCR)<n>Bi-JCR enables multiple robot manipulators, each with cameras mounted, to circumvent taking images of calibration markers.<n>By leveraging 3D foundation models for dense, marker-free multi-view correspondence, Bi-JCR jointly estimates: (i) the extrinsic transformation from each camera to its end-effector, (ii) the inter-arm relative poses between manipulators, and (iii) a unified, scale-consistent 3D representation of the shared workspace.
- Score: 13.58353565350936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robot manipulation, especially bimanual manipulation, often requires setting up multiple cameras on multiple robot manipulators. Before robot manipulators can generate motion or even build representations of their environments, the cameras rigidly mounted to the robot need to be calibrated. Camera calibration is a cumbersome process involving collecting a set of images, with each capturing a pre-determined marker. In this work, we introduce the Bi-Manual Joint Calibration and Representation Framework (Bi-JCR). Bi-JCR enables multiple robot manipulators, each with cameras mounted, to circumvent taking images of calibration markers. By leveraging 3D foundation models for dense, marker-free multi-view correspondence, Bi-JCR jointly estimates: (i) the extrinsic transformation from each camera to its end-effector, (ii) the inter-arm relative poses between manipulators, and (iii) a unified, scale-consistent 3D representation of the shared workspace, all from the same captured RGB image sets. The representation, jointly constructed from images captured by cameras on both manipulators, lives in a common coordinate frame and supports collision checking and semantic segmentation to facilitate downstream bimanual coordination tasks. We empirically evaluate the robustness of Bi-JCR on a variety of tabletop environments, and demonstrate its applicability on a variety of downstream tasks.
Related papers
- ARC-Calib: Autonomous Markerless Camera-to-Robot Calibration via Exploratory Robot Motions [15.004750210002152]
ARC-Calib is a model-based markerless camera-to-robot calibration framework.<n>It is fully autonomous and generalizable across diverse robots.
arXiv Detail & Related papers (2025-03-18T20:03:32Z) - Automatic Calibration of a Multi-Camera System with Limited Overlapping Fields of View for 3D Surgical Scene Reconstruction [0.7165255458140439]
The purpose of this study is to develop an automated and accurate external camera calibration method for 3D surgical scene reconstruction (3D-SSR)<n>We contribute a novel, fast, and fully automatic calibration method based on the projection of multi-scale markers (MSMs) using a ceiling-mounted projector.
arXiv Detail & Related papers (2025-01-27T17:10:33Z) - Kalib: Easy Hand-Eye Calibration with Reference Point Tracking [52.4190876409222]
Kalib is an automatic hand-eye calibration method that leverages the generalizability of visual foundation models to overcome challenges.<n>During calibration, a kinematic reference point is tracked in the camera coordinate 3D coordinates in the space behind the robot.<n>Kalib's user-friendly design and minimal setup requirements make it a possible solution for continuous operation in unstructured environments.
arXiv Detail & Related papers (2024-08-20T06:03:40Z) - HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning [1.4515751892711464]
We propose an end-to-end solution that addresses the 2D-3D correspondence problem.
This solution enables back-propagation from camera space outputs to the rest of the network through a new differentiable global positioning module.
We validate the effectiveness of our framework in evaluations against several baselines and state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T17:59:01Z) - Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models [13.58353565350936]
Representing the environment is a central challenge in robotics.
Traditionally, users need to calibrate the camera using a specific external marker, such as a checkerboard or AprilTag.
This paper advocates for the integration of 3D foundation representation into robotic systems equipped with manipulator-mounted RGB cameras.
arXiv Detail & Related papers (2024-04-17T18:29:32Z) - Human Mesh Recovery from Arbitrary Multi-view Images [57.969696744428475]
We propose a divide and conquer framework for Unified Human Mesh Recovery (U-HMR) from arbitrary multi-view images.
In particular, U-HMR consists of a decoupled structure and two main components: camera and body decoupling (CBD), camera pose estimation (CPE) and arbitrary view fusion (AVF)
We conduct extensive experiments on three public datasets: Human3.6M, MPI-INF-3DHP, and TotalCapture.
arXiv Detail & Related papers (2024-03-19T04:47:56Z) - Anyview: Generalizable Indoor 3D Object Detection with Variable Frames [63.51422844333147]
We present a novel 3D detection framework named AnyView for our practical applications.
Our method achieves both great generalizability and high detection accuracy with a simple and clean architecture.
arXiv Detail & Related papers (2023-10-09T02:15:45Z) - Monocular 3D Reconstruction of Interacting Hands via Collision-Aware
Factorized Refinements [96.40125818594952]
We make the first attempt to reconstruct 3D interacting hands from monocular single RGB images.
Our method can generate 3D hand meshes with both precise 3D poses and minimal collisions.
arXiv Detail & Related papers (2021-11-01T08:24:10Z) - RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB
Video [76.86512780916827]
We present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera.
In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN.
We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline.
arXiv Detail & Related papers (2021-06-22T12:53:56Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.