Towards Generalist Robot Learning from Internet Video: A Survey
- URL: http://arxiv.org/abs/2404.19664v2
- Date: Fri, 7 Jun 2024 09:25:42 GMT
- Title: Towards Generalist Robot Learning from Internet Video: A Survey
- Authors: Robert McCarthy, Daniel C. H. Tan, Dominik Schmidt, Fernando Acero, Nathan Herr, Yilun Du, Thomas G. Thuruthel, Zhibin Li,
- Abstract summary: This survey presents an overview of methods for learning from video (LfV) in the context of reinforcement learning (RL) and robotics.
We focus on methods capable of scaling to large internet video datasets.
We advocate for scalable foundation model approaches that can leverage the full range of internet video data.
- Score: 56.621902345314645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This survey presents an overview of methods for learning from video (LfV) in the context of reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large internet video datasets and, in the process, extracting foundational knowledge about the world's dynamics and physical human behaviour. Such methods hold great promise for developing general-purpose robots. We open with an overview of fundamental concepts relevant to the LfV-for-robotics setting. This includes a discussion of the exciting benefits LfV methods can offer (e.g., improved generalization beyond the available robot data) and commentary on key LfV challenges (e.g., missing information in video and LfV distribution shifts). Our literature review begins with an analysis of video foundation model techniques that can extract knowledge from large, heterogeneous video datasets. Next, we review methods that specifically leverage video data for robot learning. Here, we categorise work according to which RL knowledge modality (KM) benefits from the use of video data. We additionally highlight techniques for mitigating LfV challenges, including reviewing action representations that address missing action labels in video. Finally, we examine LfV datasets and benchmarks, before concluding with a discussion of challenges and opportunities in LfV. Here, we advocate for scalable foundation model approaches that can leverage the full range of internet video data, and that target the learning of the most promising RL KMs: the policy and dynamics model. Overall, we hope this survey will serve as a comprehensive reference for the emerging field of LfV, catalysing further research in the area and facilitating progress towards the development of general-purpose robots.
Related papers
- $π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge.
We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z) - A Brief Survey on Leveraging Large Scale Vision Models for Enhanced Robot Grasping [4.7079226008262145]
Robotic grasping presents a difficult motor task in real-world scenarios.
Recent advancements in computer vision have witnessed a growth of successful unsupervised training mechanisms.
We investigate the potential benefits of large-scale visual pretraining in enhancing robot grasping performance.
arXiv Detail & Related papers (2024-06-17T17:39:30Z) - Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets.
We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos.
Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z) - Learning by Watching: A Review of Video-based Learning Approaches for
Robot Manipulation [0.0]
Recent works have explored learning manipulation skills by passively watching abundant videos sourced online.
This survey reviews foundations such as video feature representation learning techniques, object affordance understanding, 3D hand/body modeling, and large-scale robot resources.
We discuss how learning only from observing large-scale human videos can enhance generalization and sample efficiency for robotic manipulation.
arXiv Detail & Related papers (2024-02-11T08:41:42Z) - Robotic Offline RL from Internet Videos via Value-Function Pre-Training [67.44673316943475]
We develop a system for leveraging large-scale human video datasets in robotic offline RL.
We show that value learning on video datasets learns representations more conducive to downstream robotic offline RL than other approaches.
arXiv Detail & Related papers (2023-09-22T17:59:14Z) - Scaling Robot Learning with Semantically Imagined Experience [21.361979238427722]
Recent advances in robot learning have shown promise in enabling robots to perform manipulation tasks.
One of the key contributing factors to this progress is the scale of robot data used to train the models.
We propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing.
arXiv Detail & Related papers (2023-02-22T18:47:51Z) - RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties.
We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z) - Actionable Models: Unsupervised Offline Reinforcement Learning of
Robotic Skills [93.12417203541948]
We propose the objective of learning a functional understanding of the environment by learning to reach any goal state in a given dataset.
We find that our method can operate on high-dimensional camera images and learn a variety of skills on real robots that generalize to previously unseen scenes and objects.
arXiv Detail & Related papers (2021-04-15T20:10:11Z) - Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human
Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task.
DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos.
DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.