A Universal Semantic-Geometric Representation for Robotic Manipulation
- URL: http://arxiv.org/abs/2306.10474v2
- Date: Fri, 13 Oct 2023 13:05:26 GMT
- Title: A Universal Semantic-Geometric Representation for Robotic Manipulation
- Authors: Tong Zhang, Yingdong Hu, Hanchen Cui, Hang Zhao, Yang Gao
- Abstract summary: We present $textbfSemantic-Geometric Representation (textbfSGR)$, a universal perception module for robotics.
SGR leverages the rich semantic information of large-scale pre-trained 2D models and inherits the merits of 3D spatial reasoning.
Our experiments demonstrate that SGR empowers the agent to successfully complete a diverse range of simulated and real-world robotic manipulation tasks.
- Score: 42.18087956844491
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Robots rely heavily on sensors, especially RGB and depth cameras, to perceive
and interact with the world. RGB cameras record 2D images with rich semantic
information while missing precise spatial information. On the other side, depth
cameras offer critical 3D geometry data but capture limited semantics.
Therefore, integrating both modalities is crucial for learning representations
for robotic perception and control. However, current research predominantly
focuses on only one of these modalities, neglecting the benefits of
incorporating both. To this end, we present $\textbf{Semantic-Geometric
Representation} (\textbf{SGR})$, a universal perception module for robotics
that leverages the rich semantic information of large-scale pre-trained 2D
models and inherits the merits of 3D spatial reasoning. Our experiments
demonstrate that SGR empowers the agent to successfully complete a diverse
range of simulated and real-world robotic manipulation tasks, outperforming
state-of-the-art methods significantly in both single-task and multi-task
settings. Furthermore, SGR possesses the capability to generalize to novel
semantic attributes, setting it apart from the other methods. Project website:
https://semantic-geometric-representation.github.io.
Related papers
- 3D Instance Segmentation Using Deep Learning on RGB-D Indoor Data [0.0]
2D region based convolutional neural networks (Mask R-CNN) deep learning model with point based rending module is adapted to integrate with depth information to recognize and segment 3D instances of objects.
In order to generate 3D point cloud coordinates, segmented 2D pixels of recognized object regions in the RGB image are merged into (u, v) points of the depth image.
arXiv Detail & Related papers (2024-06-19T08:00:35Z) - Render and Diffuse: Aligning Image and Action Spaces for Diffusion-based Behaviour Cloning [15.266994159289645]
We introduce Render and Diffuse (R&D) a method that unifies low-level robot actions and RGB observations within the image space using virtual renders of the 3D model of the robot.
This space unification simplifies the learning problem and introduces inductive biases that are crucial for sample efficiency and spatial generalisation.
Our results show that R&D exhibits strong spatial generalisation capabilities and is more sample efficient than more common image-to-action methods.
arXiv Detail & Related papers (2024-05-28T14:06:10Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - ImageManip: Image-based Robotic Manipulation with Affordance-guided Next
View Selection [10.162882793554191]
3D articulated object manipulation is essential for enabling robots to interact with their environment.
Many existing studies make use of 3D point clouds as the primary input for manipulation policies.
RGB images offer high-resolution observations using cost effective devices but lack spatial 3D geometric information.
This framework is designed to capture multiple perspectives of the target object and infer depth information to complement its geometry.
arXiv Detail & Related papers (2023-10-13T12:42:54Z) - SSR-2D: Semantic 3D Scene Reconstruction from 2D Images [54.46126685716471]
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations.
The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images.
Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet.
arXiv Detail & Related papers (2023-02-07T17:47:52Z) - Towards Multimodal Multitask Scene Understanding Models for Indoor
Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments.
We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges.
MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks.
We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z) - Extracting Zero-shot Common Sense from Large Language Models for Robot
3D Scene Understanding [25.270772036342688]
We introduce a novel method for leveraging common sense embedded within large language models for labelling rooms.
The proposed algorithm operates on 3D scene graphs produced by modern spatial perception systems.
arXiv Detail & Related papers (2022-06-09T16:05:35Z) - Indoor Semantic Scene Understanding using Multi-modality Fusion [0.0]
We present a semantic scene understanding pipeline that fuses 2D and 3D detection branches to generate a semantic map of the environment.
Unlike previous works that were evaluated on collected datasets, we test our pipeline on an active photo-realistic robotic environment.
Our novelty includes rectification of 3D proposals using projected 2D detections and modality fusion based on object size.
arXiv Detail & Related papers (2021-08-17T13:30:02Z) - RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB
Video [76.86512780916827]
We present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera.
In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN.
We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline.
arXiv Detail & Related papers (2021-06-22T12:53:56Z) - Unsupervised Learning of Visual 3D Keypoints for Control [104.92063943162896]
Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations.
We propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner.
These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space.
arXiv Detail & Related papers (2021-06-14T17:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.