Related papers: Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation

Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation

URL: http://arxiv.org/abs/2501.04268v1
Date: Wed, 08 Jan 2025 04:30:45 GMT
Title: Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation
Authors: Senwei Xie, Hongyu Wang, Zhanqi Xiao, Ruiping Wang, Xilin Chen,
Abstract summary: RoboPro is a robotic foundation model to perform robotic manipulation with policy code in a zero-shot manner.<n>RoboPro achieves the state-of-the-art zero-shot performance on robotic manipulation in both simulators and real-world environments.
Score: 29.67033327646875
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Zero-shot generalization across various robots, tasks and environments remains a significant challenge in robotic manipulation. Policy code generation methods use executable code to connect high-level task descriptions and low-level action sequences, leveraging the generalization capabilities of large language models and atomic skill libraries. In this work, we propose Robotic Programmer (RoboPro), a robotic foundation model, enabling the capability of perceiving visual information and following free-form instructions to perform robotic manipulation with policy code in a zero-shot manner. To address low efficiency and high cost in collecting runtime code data for robotic tasks, we devise Video2Code to synthesize executable code from extensive videos in-the-wild with off-the-shelf vision-language model and code-domain large language model. Extensive experiments show that RoboPro achieves the state-of-the-art zero-shot performance on robotic manipulation in both simulators and real-world environments. Specifically, the zero-shot success rate of RoboPro on RLBench surpasses the state-of-the-art model GPT-4o by 11.6%, which is even comparable to a strong supervised training baseline. Furthermore, RoboPro is robust to variations on API formats and skill sets.

Related papers

Large Video Planner Enables Generalizable Robot Control [117.49024534548319]
General-purpose robots require decision-making models that generalize across diverse tasks and environments.<n>Recent works build robot foundation models by extending multimodal large language models (LMs) with action outputs, creating vision--action (VLA) systems.<n>We explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models.
arXiv Detail & Related papers (2025-12-17T18:35:54Z)
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics [55.05920313034645]
We introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control.<n>Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions.<n>Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks.
arXiv Detail & Related papers (2025-05-29T16:41:12Z)
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [133.23509142762356]
General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy. We introduce GR00T N1, an open foundation model for humanoid robots.
arXiv Detail & Related papers (2025-03-18T21:06:21Z)
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos. VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z)
RoboGrasp: A Universal Grasping Policy for Robust Robotic Control [8.189496387470726]
RoboGrasp is a universal grasping policy framework that integrates pretrained grasp detection models with robotic learning. It significantly enhances grasp precision, stability, and generalizability, achieving up to 34% higher success rates in few-shot learning and grasping box prompt tasks.
arXiv Detail & Related papers (2025-02-05T11:04:41Z)
$π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z)
Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs [0.0]
Wonderful Team is a framework for executing high level robotic planning in a zero-shot regime.<n>We show that Wonderful Team's performance on a real-world semantic and physical planning tasks often exceeds methods that rely on separate vision systems.
arXiv Detail & Related papers (2024-07-26T21:18:57Z)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics [46.63773228934993]
We introduce an automatic synthetic data generation pipeline that instruction-tunes vision language models (VLMs) to robotic domains and needs. Using the pipeline, we train RoboPoint, a VLM that predicts image keypoint affordances given language instructions. Our experiments demonstrate that RoboPoint outperforms state-of-the-art VLMs by 21.8% in the accuracy of predicting spatial affordance and by 30.5% in the success rate of downstream tasks.
arXiv Detail & Related papers (2024-06-15T19:22:51Z)
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis [102.1876259853457]
We propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning.
arXiv Detail & Related papers (2024-02-25T15:31:43Z)
RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation. We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language. We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z)
Prompt a Robot to Walk with Large Language Models [18.214609570837403]
Large language models (LLMs) pre-trained on vast internet-scale data have showcased remarkable capabilities across diverse domains. We introduce a novel paradigm in which we use few-shot prompts collected from the physical environment. Experiments across various robots and environments validate that our method can effectively prompt a robot to walk.
arXiv Detail & Related papers (2023-09-18T17:50:17Z)
RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.