Render and Diffuse: Aligning Image and Action Spaces for Diffusion-based Behaviour Cloning
- URL: http://arxiv.org/abs/2405.18196v1
- Date: Tue, 28 May 2024 14:06:10 GMT
- Title: Render and Diffuse: Aligning Image and Action Spaces for Diffusion-based Behaviour Cloning
- Authors: Vitalis Vosylius, Younggyo Seo, Jafar Uruç, Stephen James,
- Abstract summary: We introduce Render and Diffuse (R&D) a method that unifies low-level robot actions and RGB observations within the image space using virtual renders of the 3D model of the robot.
This space unification simplifies the learning problem and introduces inductive biases that are crucial for sample efficiency and spatial generalisation.
Our results show that R&D exhibits strong spatial generalisation capabilities and is more sample efficient than more common image-to-action methods.
- Score: 15.266994159289645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the field of Robot Learning, the complex mapping between high-dimensional observations such as RGB images and low-level robotic actions, two inherently very different spaces, constitutes a complex learning problem, especially with limited amounts of data. In this work, we introduce Render and Diffuse (R&D) a method that unifies low-level robot actions and RGB observations within the image space using virtual renders of the 3D model of the robot. Using this joint observation-action representation it computes low-level robot actions using a learnt diffusion process that iteratively updates the virtual renders of the robot. This space unification simplifies the learning problem and introduces inductive biases that are crucial for sample efficiency and spatial generalisation. We thoroughly evaluate several variants of R&D in simulation and showcase their applicability on six everyday tasks in the real world. Our results show that R&D exhibits strong spatial generalisation capabilities and is more sample efficient than more common image-to-action methods.
Related papers
- Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction [51.49400490437258]
This work develops a method for imitating articulated object manipulation from a single monocular RGB human demonstration.
We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video.
Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion.
We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot.
arXiv Detail & Related papers (2024-09-26T17:57:16Z) - 3D Hand Mesh Recovery from Monocular RGB in Camera Space [3.0453197258042213]
This study proposes a network model that performs parallel processing of root-relative grids and root recovery tasks.
We utilize an implicit learning approach for 2D heatmaps, enhancing the compatibility of 2D cues across different subtasks.
Our proposed model is comparable with state-of-the-art models.
arXiv Detail & Related papers (2024-05-12T05:36:37Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Closing the Visual Sim-to-Real Gap with Object-Composable NeRFs [59.12526668734703]
We introduce Composable Object Volume NeRF (COV-NeRF), an object-composable NeRF model that is the centerpiece of a real-to-sim pipeline.
COV-NeRF extracts objects from real images and composes them into new scenes, generating photorealistic renderings and many types of 2D and 3D supervision.
arXiv Detail & Related papers (2024-03-07T00:00:02Z) - A Universal Semantic-Geometric Representation for Robotic Manipulation [42.18087956844491]
We present $textbfSemantic-Geometric Representation (textbfSGR)$, a universal perception module for robotics.
SGR leverages the rich semantic information of large-scale pre-trained 2D models and inherits the merits of 3D spatial reasoning.
Our experiments demonstrate that SGR empowers the agent to successfully complete a diverse range of simulated and real-world robotic manipulation tasks.
arXiv Detail & Related papers (2023-06-18T04:34:17Z) - Neural Scene Representation for Locomotion on Structured Terrain [56.48607865960868]
We propose a learning-based method to reconstruct the local terrain for a mobile robot traversing urban environments.
Using a stream of depth measurements from the onboard cameras and the robot's trajectory, the estimates the topography in the robot's vicinity.
We propose a 3D reconstruction model that faithfully reconstructs the scene, despite the noisy measurements and large amounts of missing data coming from the blind spots of the camera arrangement.
arXiv Detail & Related papers (2022-06-16T10:45:17Z) - Object Scene Representation Transformer [56.40544849442227]
We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis.
OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods.
It is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder.
arXiv Detail & Related papers (2022-06-14T15:40:47Z) - Simple and Effective Synthesis of Indoor 3D Scenes [78.95697556834536]
We study the problem of immersive 3D indoor scenes from one or more images.
Our aim is to generate high-resolution images and videos from novel viewpoints.
We propose an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images.
arXiv Detail & Related papers (2022-04-06T17:54:46Z) - SAGCI-System: Towards Sample-Efficient, Generalizable, Compositional,
and Incremental Robot Learning [41.19148076789516]
We introduce a systematic learning framework called SAGCI-system towards achieving the above four requirements.
Our system first takes the raw point clouds gathered by the camera mounted on the robot's wrist as the inputs and produces initial modeling of the surrounding environment represented as a URDF.
The robot then utilizes the interactive perception to interact with the environments to online verify and modify the URDF.
arXiv Detail & Related papers (2021-11-29T16:53:49Z) - Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via
Implicit Representations [20.155920256334706]
We show that 3D reconstruction and grasp learning are two intimately connected tasks.
We propose to utilize the synergies between grasp affordance and 3D reconstruction through multi-task learning of a shared representation.
Our method outperforms baselines by over 10% in terms of grasp success rate.
arXiv Detail & Related papers (2021-04-04T05:46:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.