Related papers: RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

URL: http://arxiv.org/abs/2403.19622v2
Date: Sat, 01 Feb 2025 11:17:14 GMT
Title: RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents
Authors: Zeren Chen, Zhelun Shi, Xiaoya Lu, Lehan He, Sucheng Qian, Zhenfei Yin, Wanli Ouyang, Jing Shao, Yu Qiao, Cewu Lu, Lu Sheng,
Abstract summary: We propose RH20T-P, a primitive-level robotic manipulation dataset.<n>It contains about 38k video clips covering 67 diverse manipulation tasks in real-world scenarios.<n>We standardize a plan-execute CGA paradigm and implement an exemplar baseline called RA-P on our RH20T-P.
Score: 105.13169239919272
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Achieving generalizability in solving out-of-distribution tasks is one of the ultimate goals of learning robotic manipulation. Recent progress of Vision-Language Models (VLMs) has shown that VLM-based task planners can alleviate the difficulty of solving novel tasks, by decomposing the compounded tasks as a plan of sequentially executing primitive-level skills that have been already mastered. It is also promising for robotic manipulation to adapt such composable generalization ability, in the form of composable generalization agents (CGAs). However, the community lacks of reliable design of primitive skills and a sufficient amount of primitive-level data annotations. Therefore, we propose RH20T-P, a primitive-level robotic manipulation dataset, which contains about 38k video clips covering 67 diverse manipulation tasks in real-world scenarios. Each clip is manually annotated according to a set of meticulously designed primitive skills that are common in robotic manipulation. Furthermore, we standardize a plan-execute CGA paradigm and implement an exemplar baseline called RA-P on our RH20T-P, whose positive performance on solving unseen tasks validates that the proposed dataset can offer composable generalization ability to robotic manipulation agents.

Related papers

Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization [19.32522292907096]
AGNOSTOS is a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation.<n>X-ICM is a method that conditions large language models on in-context demonstrations to predict action sequences for unseen tasks.<n>We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.
arXiv Detail & Related papers (2025-05-21T15:35:57Z)
RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation [90.81956345363355]
RoBridge is a hierarchical intelligent architecture for general robotic manipulation.<n>It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained vision-language model (VLM)<n>It unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution.
arXiv Detail & Related papers (2025-05-03T06:17:18Z)
EmbodiedAgent: A Scalable Hierarchical Approach to Overcome Practical Challenge in Multi-Robot Control [4.163413782205929]
EmbodiedAgent is a hierarchical framework for heterogeneous multi-robot control. Our approach integrates a next-action prediction paradigm with a structured memory system to decompose tasks into executable robot skills. We present MultiPlan+, a dataset of more than 18,000 annotated planning instances spanning 100 scenarios.
arXiv Detail & Related papers (2025-04-14T09:33:42Z)
RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics [22.007302996282085]
This paper presents a temporal-decoupling fine-tuning strategy based on Contrastive Language-Image Pretraining (CLIP) architecture. Results in simulated environments demonstrate that the RoboAct-CLIP pretrained model achieves a 12% higher success rate than baseline Visual Language Models.
arXiv Detail & Related papers (2025-04-02T19:02:08Z)
GRAPE: Generalizing Robot Policy via Preference Alignment [58.419992317452376]
We present GRAPE: Generalizing Robot Policy via Preference Alignment. We show GRAPE increases success rates on in-domain and unseen manipulation tasks by 51.79% and 58.20%, respectively. GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 37.44% and rollout step-length by 11.15%, respectively.
arXiv Detail & Related papers (2024-11-28T18:30:10Z)
$π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z)
Grounding Robot Policies with Visuomotor Language Guidance [15.774237279917594]
We propose an agent-based framework for grounding robot policies to the current context. The proposed framework is composed of a set of conversational agents designed for specific roles. We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates.
arXiv Detail & Related papers (2024-10-09T02:00:37Z)
Robotic Control via Embodied Chain-of-Thought Reasoning [86.6680905262442]
Key limitation of learned robot control policies is their inability to generalize outside their training data. Recent works on vision-language-action models (VLAs) have shown that the use of large, internet pre-trained vision-language models can substantially improve their robustness and generalization ability. We introduce Embodied Chain-of-Thought Reasoning (ECoT) for VLAs, in which we train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features before predicting the robot action.
arXiv Detail & Related papers (2024-07-11T17:31:01Z)
Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis [82.59451639072073]
General-purpose robots operate seamlessly in any environment, with any object, and utilize various skills to complete diverse tasks. As a community, we have been constraining most robotic systems by designing them for specific tasks, training them on specific datasets, and deploying them within specific environments. Motivated by the impressive open-set performance and content generation capabilities of web-scale, large-capacity pre-trained models, we devote this survey to exploring how foundation models can be applied to general-purpose robotics.
arXiv Detail & Related papers (2023-12-14T10:02:55Z)
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation [68.70755196744533]
RoboGen is a generative robotic agent that automatically learns diverse robotic skills at scale via generative simulation. Our work attempts to extract the extensive and versatile knowledge embedded in large-scale models and transfer them to the field of robotics.
arXiv Detail & Related papers (2023-11-02T17:59:21Z)
RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking [54.776890150458385]
We develop an efficient system for training universal agents capable of multi-task manipulation skills. We are able to train a single agent capable of 12 unique skills, and demonstrate its generalization over 38 tasks. On average, RoboAgent outperforms prior methods by over 40% in unseen situations.
arXiv Detail & Related papers (2023-09-05T03:14:39Z)
RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot [56.130215236125224]
A key challenge in robotic manipulation in open domains is how to acquire diverse and generalizable skills for robots. Recent research in one-shot imitation learning has shown promise in transferring trained policies to new tasks based on demonstrations. This paper aims to unlock the potential for an agent to generalize to hundreds of real-world skills with multi-modal perception.
arXiv Detail & Related papers (2023-07-02T15:33:31Z)
RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z)
BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning [108.41464483878683]
We study the problem of enabling a vision-based robotic manipulation system to generalize to novel tasks. We develop an interactive and flexible imitation learning system that can learn from both demonstrations and interventions. When scaling data collection on a real robot to more than 100 distinct tasks, we find that this system can perform 24 unseen manipulation tasks with an average success rate of 44%.
arXiv Detail & Related papers (2022-02-04T07:30:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.