RoboScript: Code Generation for Free-Form Manipulation Tasks across Real
and Simulation
- URL: http://arxiv.org/abs/2402.14623v1
- Date: Thu, 22 Feb 2024 15:12:00 GMT
- Title: RoboScript: Code Generation for Free-Form Manipulation Tasks across Real
and Simulation
- Authors: Junting Chen, Yao Mu, Qiaojun Yu, Tianming Wei, Silang Wu, Zhecheng
Yuan, Zhixuan Liang, Chao Yang, Kaipeng Zhang, Wenqi Shao, Yu Qiao, Huazhe
Xu, Mingyu Ding, Ping Luo
- Abstract summary: This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation.
We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language.
We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
- Score: 77.41969287400977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Rapid progress in high-level task planning and code generation for open-world
robot manipulation has been witnessed in Embodied AI. However, previous studies
put much effort into general common sense reasoning and task planning
capabilities of large-scale language or multi-modal models, relatively little
effort on ensuring the deployability of generated code on real robots, and
other fundamental components of autonomous robot systems including robot
perception, motion planning, and control. To bridge this ``ideal-to-real'' gap,
this paper presents \textbf{RobotScript}, a platform for 1) a deployable robot
manipulation pipeline powered by code generation; and 2) a code generation
benchmark for robot manipulation tasks in free-form natural language. The
RobotScript platform addresses this gap by emphasizing the unified interface
with both simulation and real robots, based on abstraction from the Robot
Operating System (ROS), ensuring syntax compliance and simulation validation
with Gazebo. We demonstrate the adaptability of our code generation framework
across multiple robot embodiments, including the Franka and UR5 robot arms, and
multiple grippers. Additionally, our benchmark assesses reasoning abilities for
physical space and constraints, highlighting the differences between GPT-3.5,
GPT-4, and Gemini in handling complex physical interactions. Finally, we
present a thorough evaluation on the whole system, exploring how each module in
the pipeline: code generation, perception, motion planning, and even object
geometric properties, impact the overall performance of the system.
Related papers
- $π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge.
We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z) - $\textbf{EMOS}$: $\textbf{E}$mbodiment-aware Heterogeneous $\textbf{M}$ulti-robot $\textbf{O}$perating $\textbf{S}$ystem with LLM Agents [33.77674812074215]
We introduce a novel multi-agent framework designed to enable effective collaboration among heterogeneous robots.
We propose a self-prompted approach, where agents comprehend robot URDF files and call robot kinematics tools to generate descriptions of their physics capabilities.
The Habitat-MAS benchmark is designed to assess how a multi-agent framework handles tasks that require embodiment-aware reasoning.
arXiv Detail & Related papers (2024-10-30T03:20:01Z) - Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models [53.22792173053473]
We introduce an interactive robotic manipulation framework called Polaris.
Polaris integrates perception and interaction by utilizing GPT-4 alongside grounded vision models.
We propose a novel Synthetic-to-Real (Syn2Real) pose estimation pipeline.
arXiv Detail & Related papers (2024-08-15T06:40:38Z) - Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models [81.55156507635286]
Legged robots are physically capable of navigating a diverse variety of environments and overcoming a wide range of obstructions.
Current learning methods often struggle with generalization to the long tail of unexpected situations without heavy human supervision.
We propose a system, VLM-Predictive Control (VLM-PC), combining two key components that we find to be crucial for eliciting on-the-fly, adaptive behavior selection.
arXiv Detail & Related papers (2024-07-02T21:00:30Z) - RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis [102.1876259853457]
We propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX.
RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints.
To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning.
arXiv Detail & Related papers (2024-02-25T15:31:43Z) - WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model [92.90127398282209]
This paper investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system.
We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration.
We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task.
arXiv Detail & Related papers (2023-08-30T11:35:21Z) - SEAL: Semantic Frame Execution And Localization for Perceiving Afforded
Robot Actions [5.522839151632667]
We extend the semantic frame representation for robot manipulation actions and introduce the problem of Semantic Frame Execution And Localization for Perceiving Afforded Robot Actions (SEAL) as a graphical model.
For the SEAL problem, we describe our nonparametric Semantic Frame Mapping (SeFM) algorithm for maintaining belief over a finite set of semantic frames as the locations of actions afforded to the robot.
arXiv Detail & Related papers (2023-03-24T15:25:41Z) - robo-gym -- An Open Source Toolkit for Distributed Deep Reinforcement
Learning on Real and Simulated Robots [0.5161531917413708]
We propose an open source toolkit: robo-gym to increase the use of Deep Reinforcement Learning with real robots.
We demonstrate a unified setup for simulation and real environments which enables a seamless transfer from training in simulation to application on the robot.
We showcase the capabilities and the effectiveness of the framework with two real world applications featuring industrial robots.
arXiv Detail & Related papers (2020-07-06T13:51:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.