Related papers: RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

URL: http://arxiv.org/abs/2403.19622v1
Date: Thu, 28 Mar 2024 17:42:54 GMT
Title: RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents
Authors: Zeren Chen, Zhelun Shi, Xiaoya Lu, Lehan He, Sucheng Qian, Hao Shu Fang, Zhenfei Yin, Wanli Ouyang, Jing Shao, Yu Qiao, Cewu Lu, Lu Sheng,
Abstract summary: The ultimate goals of robotic learning is to acquire a comprehensive and generalizable robotic system. Recent progress in utilizing language models as high-level planners has demonstrated that the complexity of tasks can be reduced through decomposing them into primitive-level plans. Despite the promising future, the community is not yet adequately prepared for composable generalization agents.
Score: 107.97394661147102
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The ultimate goals of robotic learning is to acquire a comprehensive and generalizable robotic system capable of performing both seen skills within the training distribution and unseen skills in novel environments. Recent progress in utilizing language models as high-level planners has demonstrated that the complexity of tasks can be reduced through decomposing them into primitive-level plans, making it possible to generalize on novel robotic tasks in a composable manner. Despite the promising future, the community is not yet adequately prepared for composable generalization agents, particularly due to the lack of primitive-level real-world robotic datasets. In this paper, we propose a primitive-level robotic dataset, namely RH20T-P, which contains about 33000 video clips covering 44 diverse and complicated robotic tasks. Each clip is manually annotated according to a set of meticulously designed primitive skills, facilitating the future development of composable generalization agents. To validate the effectiveness of RH20T-P, we also construct a potential and scalable agent based on RH20T-P, called RA-P. Equipped with two planners specialized in task decomposition and motion planning, RA-P can adapt to novel physical skills through composable generalization. Our website and videos can be found at https://sites.google.com/view/rh20t-primitive/main. Dataset and code will be made available soon.

Related papers

Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization [19.32522292907096]
AGNOSTOS is a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation.<n>X-ICM is a method that conditions large language models on in-context demonstrations to predict action sequences for unseen tasks.<n>We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.
arXiv Detail & Related papers (2025-05-21T15:35:57Z)
RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation [90.81956345363355]
RoBridge is a hierarchical intelligent architecture for general robotic manipulation.<n>It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained vision-language model (VLM)<n>It unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution.
arXiv Detail & Related papers (2025-05-03T06:17:18Z)
EmbodiedAgent: A Scalable Hierarchical Approach to Overcome Practical Challenge in Multi-Robot Control [4.163413782205929]
EmbodiedAgent is a hierarchical framework for heterogeneous multi-robot control. Our approach integrates a next-action prediction paradigm with a structured memory system to decompose tasks into executable robot skills. We present MultiPlan+, a dataset of more than 18,000 annotated planning instances spanning 100 scenarios.
arXiv Detail & Related papers (2025-04-14T09:33:42Z)
RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics [22.007302996282085]
This paper presents a temporal-decoupling fine-tuning strategy based on Contrastive Language-Image Pretraining (CLIP) architecture. Results in simulated environments demonstrate that the RoboAct-CLIP pretrained model achieves a 12% higher success rate than baseline Visual Language Models.
arXiv Detail & Related papers (2025-04-02T19:02:08Z)
GRAPE: Generalizing Robot Policy via Preference Alignment [58.419992317452376]
We present GRAPE: Generalizing Robot Policy via Preference Alignment. We show GRAPE increases success rates on in-domain and unseen manipulation tasks by 51.79% and 58.20%, respectively. GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 37.44% and rollout step-length by 11.15%, respectively.
arXiv Detail & Related papers (2024-11-28T18:30:10Z)
$π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z)
Grounding Robot Policies with Visuomotor Language Guidance [15.774237279917594]
We propose an agent-based framework for grounding robot policies to the current context. The proposed framework is composed of a set of conversational agents designed for specific roles. We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates.
arXiv Detail & Related papers (2024-10-09T02:00:37Z)
Robotic Control via Embodied Chain-of-Thought Reasoning [86.6680905262442]
Key limitation of learned robot control policies is their inability to generalize outside their training data. Recent works on vision-language-action models (VLAs) have shown that the use of large, internet pre-trained vision-language models can substantially improve their robustness and generalization ability. We introduce Embodied Chain-of-Thought Reasoning (ECoT) for VLAs, in which we train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features before predicting the robot action.
arXiv Detail & Related papers (2024-07-11T17:31:01Z)
Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis [82.59451639072073]
General-purpose robots operate seamlessly in any environment, with any object, and utilize various skills to complete diverse tasks. As a community, we have been constraining most robotic systems by designing them for specific tasks, training them on specific datasets, and deploying them within specific environments. Motivated by the impressive open-set performance and content generation capabilities of web-scale, large-capacity pre-trained models, we devote this survey to exploring how foundation models can be applied to general-purpose robotics.
arXiv Detail & Related papers (2023-12-14T10:02:55Z)
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation [68.70755196744533]
RoboGen is a generative robotic agent that automatically learns diverse robotic skills at scale via generative simulation. Our work attempts to extract the extensive and versatile knowledge embedded in large-scale models and transfer them to the field of robotics.
arXiv Detail & Related papers (2023-11-02T17:59:21Z)
RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking [54.776890150458385]
We develop an efficient system for training universal agents capable of multi-task manipulation skills. We are able to train a single agent capable of 12 unique skills, and demonstrate its generalization over 38 tasks. On average, RoboAgent outperforms prior methods by over 40% in unseen situations.
arXiv Detail & Related papers (2023-09-05T03:14:39Z)
RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot [56.130215236125224]
A key challenge in robotic manipulation in open domains is how to acquire diverse and generalizable skills for robots. Recent research in one-shot imitation learning has shown promise in transferring trained policies to new tasks based on demonstrations. This paper aims to unlock the potential for an agent to generalize to hundreds of real-world skills with multi-modal perception.
arXiv Detail & Related papers (2023-07-02T15:33:31Z)
RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z)
BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning [108.41464483878683]
We study the problem of enabling a vision-based robotic manipulation system to generalize to novel tasks. We develop an interactive and flexible imitation learning system that can learn from both demonstrations and interventions. When scaling data collection on a real robot to more than 100 distinct tasks, we find that this system can perform 24 unseen manipulation tasks with an average success rate of 44%.
arXiv Detail & Related papers (2022-02-04T07:30:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.