Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models
- URL: http://arxiv.org/abs/2404.11683v1
- Date: Wed, 17 Apr 2024 18:29:32 GMT
- Title: Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models
- Authors: Weiming Zhi, Haozhan Tang, Tianyi Zhang, Matthew Johnson-Roberson,
- Abstract summary: Representing the environment is a central challenge in robotics.
Traditionally, users need to calibrate the camera using a specific external marker, such as a checkerboard or AprilTag.
This paper advocates for the integration of 3D foundation representation into robotic systems equipped with manipulator-mounted RGB cameras.
- Score: 13.58353565350936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Representing the environment is a central challenge in robotics, and is essential for effective decision-making. Traditionally, before capturing images with a manipulator-mounted camera, users need to calibrate the camera using a specific external marker, such as a checkerboard or AprilTag. However, recent advances in computer vision have led to the development of \emph{3D foundation models}. These are large, pre-trained neural networks that can establish fast and accurate multi-view correspondences with very few images, even in the absence of rich visual features. This paper advocates for the integration of 3D foundation models into scene representation approaches for robotic systems equipped with manipulator-mounted RGB cameras. Specifically, we propose the Joint Calibration and Representation (JCR) method. JCR uses RGB images, captured by a manipulator-mounted camera, to simultaneously construct an environmental representation and calibrate the camera relative to the robot's end-effector, in the absence of specific calibration markers. The resulting 3D environment representation is aligned with the robot's coordinate frame and maintains physically accurate scales. We demonstrate that JCR can build effective scene representations using a low-cost RGB camera attached to a manipulator, without prior calibration.
Related papers
- Neural Real-Time Recalibration for Infrared Multi-Camera Systems [2.249916681499244]
There are no learning-free or neural techniques for real-time recalibration of infrared multi-camera systems.
We propose a neural network-based method capable of dynamic real-time calibration.
arXiv Detail & Related papers (2024-10-18T14:37:37Z) - EventTransAct: A video transformer-based framework for Event-camera
based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z) - RegHEC: Hand-Eye Calibration via Simultaneous Multi-view Point Clouds
Registration of Arbitrary Object [1.7161586414363612]
RegHEC is a registration-based hand-eye calibration technique with no need for accurate calibration rig.
It tries to find the hand-eye relation which brings multi-view point clouds of arbitrary scene into simultaneous registration under a common reference frame.
arXiv Detail & Related papers (2023-04-27T11:08:35Z) - Markerless Camera-to-Robot Pose Estimation via Self-supervised
Sim-to-Real Transfer [26.21320177775571]
We propose an end-to-end pose estimation framework that is capable of online camera-to-robot calibration and a self-supervised training method.
Our framework combines deep learning and geometric vision for solving the robot pose, and the pipeline is fully differentiable.
arXiv Detail & Related papers (2023-02-28T05:55:42Z) - RGB-Only Reconstruction of Tabletop Scenes for Collision-Free
Manipulator Control [71.51781695764872]
We present a system for collision-free control of a robot manipulator that uses only RGB views of the world.
Perceptual input of a tabletop scene is provided by multiple images of an RGB camera that is either handheld or mounted on the robot end effector.
A NeRF-like process is used to reconstruct the 3D geometry of the scene, from which the Euclidean full signed distance function (ESDF) is computed.
A model predictive control algorithm is then used to control the manipulator to reach a desired pose while avoiding obstacles in the ESDF.
arXiv Detail & Related papers (2022-10-21T01:45:08Z) - Neural Scene Representation for Locomotion on Structured Terrain [56.48607865960868]
We propose a learning-based method to reconstruct the local terrain for a mobile robot traversing urban environments.
Using a stream of depth measurements from the onboard cameras and the robot's trajectory, the estimates the topography in the robot's vicinity.
We propose a 3D reconstruction model that faithfully reconstructs the scene, despite the noisy measurements and large amounts of missing data coming from the blind spots of the camera arrangement.
arXiv Detail & Related papers (2022-06-16T10:45:17Z) - Robot Self-Calibration Using Actuated 3D Sensors [0.0]
This paper treats robot calibration as an offline SLAM problem, where scanning poses are linked to a fixed point in space by a moving kinematic chain.
As such, the presented framework allows robot calibration using nothing but an arbitrary eye-in-hand depth sensor.
A detailed evaluation of the system is shown on a real robot with various attached 3D sensors.
arXiv Detail & Related papers (2022-06-07T16:35:08Z) - RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB
Video [76.86512780916827]
We present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera.
In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN.
We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline.
arXiv Detail & Related papers (2021-06-22T12:53:56Z) - Real-time RGBD-based Extended Body Pose Estimation [57.61868412206493]
We present a system for real-time RGBD-based estimation of 3D human pose.
We use parametric 3D deformable human mesh model (SMPL-X) as a representation.
We train estimators of body pose and facial expression parameters.
arXiv Detail & Related papers (2021-03-05T13:37:50Z) - Infrastructure-based Multi-Camera Calibration using Radial Projections [117.22654577367246]
Pattern-based calibration techniques can be used to calibrate the intrinsics of the cameras individually.
Infrastucture-based calibration techniques are able to estimate the extrinsics using 3D maps pre-built via SLAM or Structure-from-Motion.
We propose to fully calibrate a multi-camera system from scratch using an infrastructure-based approach.
arXiv Detail & Related papers (2020-07-30T09:21:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.