Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation
- URL: http://arxiv.org/abs/2501.04268v1
- Date: Wed, 08 Jan 2025 04:30:45 GMT
- Title: Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation
- Authors: Senwei Xie, Hongyu Wang, Zhanqi Xiao, Ruiping Wang, Xilin Chen,
- Abstract summary: RoboPro is a robotic foundation model to perform robotic manipulation with policy code in a zero-shot manner.
RoboPro achieves the state-of-the-art zero-shot performance on robotic manipulation in both simulators and real-world environments.
- Score: 29.67033327646875
- License:
- Abstract: Zero-shot generalization across various robots, tasks and environments remains a significant challenge in robotic manipulation. Policy code generation methods use executable code to connect high-level task descriptions and low-level action sequences, leveraging the generalization capabilities of large language models and atomic skill libraries. In this work, we propose Robotic Programmer (RoboPro), a robotic foundation model, enabling the capability of perceiving visual information and following free-form instructions to perform robotic manipulation with policy code in a zero-shot manner. To address low efficiency and high cost in collecting runtime code data for robotic tasks, we devise Video2Code to synthesize executable code from extensive videos in-the-wild with off-the-shelf vision-language model and code-domain large language model. Extensive experiments show that RoboPro achieves the state-of-the-art zero-shot performance on robotic manipulation in both simulators and real-world environments. Specifically, the zero-shot success rate of RoboPro on RLBench surpasses the state-of-the-art model GPT-4o by 11.6%, which is even comparable to a strong supervised training baseline. Furthermore, RoboPro is robust to variations on API formats and skill sets.
Related papers
- RoboGrasp: A Universal Grasping Policy for Robust Robotic Control [8.189496387470726]
RoboGrasp is a universal grasping policy framework that integrates pretrained grasp detection models with robotic learning.
It significantly enhances grasp precision, stability, and generalizability, achieving up to 34% higher success rates in few-shot learning and grasping box prompt tasks.
arXiv Detail & Related papers (2025-02-05T11:04:41Z) - $π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge.
We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.
First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.
We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics [46.63773228934993]
We introduce an automatic synthetic data generation pipeline that instruction-tunes vision language models (VLMs) to robotic domains and needs.
Using the pipeline, we train RoboPoint, a VLM that predicts image keypoint affordances given language instructions.
Our experiments demonstrate that RoboPoint outperforms state-of-the-art VLMs by 21.8% in the accuracy of predicting spatial affordance and by 30.5% in the success rate of downstream tasks.
arXiv Detail & Related papers (2024-06-15T19:22:51Z) - Octo: An Open-Source Generalist Robot Policy [88.14295917143188]
We introduce Octo, a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset.
It can be effectively finetuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPU.
We also perform detailed ablations of design decisions for the Octo model, from architecture to training data, to guide future research on building generalist robot models.
arXiv Detail & Related papers (2024-05-20T17:57:01Z) - RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis [102.1876259853457]
We propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX.
RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints.
To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning.
arXiv Detail & Related papers (2024-02-25T15:31:43Z) - RoboScript: Code Generation for Free-Form Manipulation Tasks across Real
and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation.
We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language.
We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z) - Prompt a Robot to Walk with Large Language Models [18.214609570837403]
Large language models (LLMs) pre-trained on vast internet-scale data have showcased remarkable capabilities across diverse domains.
We introduce a novel paradigm in which we use few-shot prompts collected from the physical environment.
Experiments across various robots and environments validate that our method can effectively prompt a robot to walk.
arXiv Detail & Related papers (2023-09-18T17:50:17Z) - RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties.
We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.