GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
- URL: http://arxiv.org/abs/2410.06158v1
- Date: Tue, 8 Oct 2024 16:00:47 GMT
- Title: GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
- Authors: Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, Minzhao Zhu,
- Abstract summary: GR-2 is a state-of-the-art generalist robot agent for versatile and generalizable manipulation.
GR-2 is first pre-trained on a vast number of Internet videos to capture the dynamics of the world.
GR-2 exhibits impressive multi-task learning capabilities, achieving an average success rate of 97.7% across more than 100 tasks.
- Score: 21.455124378231957
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture the dynamics of the world. This large-scale pre-training, involving 38 million video clips and over 50 billion tokens, equips GR-2 with the ability to generalize across a wide range of robotic tasks and environments during subsequent policy learning. Following this, GR-2 is fine-tuned for both video generation and action prediction using robot trajectories. It exhibits impressive multi-task learning capabilities, achieving an average success rate of 97.7% across more than 100 tasks. Moreover, GR-2 demonstrates exceptional generalization to new, previously unseen scenarios, including novel backgrounds, environments, objects, and tasks. Notably, GR-2 scales effectively with model size, underscoring its potential for continued growth and application. Project page: \url{https://gr2-manipulation.github.io}.
Related papers
- Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation [74.70013315714336]
Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video.
Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.
arXiv Detail & Related papers (2024-09-24T17:57:33Z) - GRUtopia: Dream General Robots in a City at Scale [65.08318324604116]
This paper introduces project GRUtopia, the first simulated interactive 3D society designed for various robots.
GRScenes includes 100k interactive, finely annotated scenes, which can be freely combined into city-scale environments.
GRResidents is a Large Language Model (LLM) driven Non-Player Character (NPC) system that is responsible for social interaction.
arXiv Detail & Related papers (2024-07-15T17:40:46Z) - LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents [107.97394661147102]
The ultimate goals of robotic learning is to acquire a comprehensive and generalizable robotic system.
Recent progress in utilizing language models as high-level planners has demonstrated that the complexity of tasks can be reduced through decomposing them into primitive-level plans.
Despite the promising future, the community is not yet adequately prepared for composable generalization agents.
arXiv Detail & Related papers (2024-03-28T17:42:54Z) - Unleashing Large-Scale Video Generative Pre-training for Visual Robot
Manipulation [25.09113607683987]
We introduce GR-1, a GPT-style model designed for multi-task language-conditioned visual robot manipulation.
GR-1 takes as inputs a language instruction, a sequence of observation images, and a sequence of robot states.
It predicts robot actions as well as future images in an end-to-end manner.
arXiv Detail & Related papers (2023-12-20T16:00:43Z) - Towards Generalizable Zero-Shot Manipulation via Translating Human
Interaction Plans [58.27029676638521]
We show how passive human videos can serve as a rich source of data for learning such generalist robots.
We learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations.
We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects.
arXiv Detail & Related papers (2023-12-01T18:54:12Z) - Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human
Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task.
DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos.
DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.