Related papers: Towards Generalist Robot Learning from Internet Video: A Survey

Towards Generalist Robot Learning from Internet Video: A Survey

URL: http://arxiv.org/abs/2404.19664v5
Date: Wed, 23 Jul 2025 17:31:03 GMT
Title: Towards Generalist Robot Learning from Internet Video: A Survey
Authors: Robert McCarthy, Daniel C. H. Tan, Dominik Schmidt, Fernando Acero, Nathan Herr, Yilun Du, Thomas G. Thuruthel, Zhibin Li,
Abstract summary: This survey systematically examines the emerging field of learning from videos (LfV)<n>We first outline essential concepts, including detailing fundamental LfV challenges such as distribution shift and missing action labels in video data.<n>Next, we comprehensively review current methods for extracting knowledge from large-scale internet video, overcoming LfV challenges, and improving robot learning through video-informed training.
Score: 56.621902345314645
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling deep learning to massive and diverse internet data has driven remarkable breakthroughs in domains such as video generation and natural language processing. Robot learning, however, has thus far failed to replicate this success and remains constrained by a scarcity of available data. Learning from videos (LfV) methods aim to address this data bottleneck by augmenting traditional robot data with large-scale internet video. This video data provides foundational information regarding physical dynamics, behaviours, and tasks, and can be highly informative for general-purpose robots. This survey systematically examines the emerging field of LfV. We first outline essential concepts, including detailing fundamental LfV challenges such as distribution shift and missing action labels in video data. Next, we comprehensively review current methods for extracting knowledge from large-scale internet video, overcoming LfV challenges, and improving robot learning through video-informed training. The survey concludes with a critical discussion of future opportunities. Here, we emphasize the need for scalable foundation model approaches that can leverage the full range of available internet video and enhance the learning of robot policies and dynamics models. Overall, the survey aims to inform and catalyse future LfV research, driving progress towards general-purpose robots.

Related papers

Emergence of Human to Robot Transfer in Vision-Language-Action Models [88.76648919814771]
Vision-language-action (VLA) models can enable broad open world generalization, but require large and diverse datasets.<n>We show that human-to-robot transfer emerges once the VLA is pre-trained on sufficient scenes, tasks, and embodiments.
arXiv Detail & Related papers (2025-12-27T00:13:11Z)
Large Video Planner Enables Generalizable Robot Control [117.49024534548319]
General-purpose robots require decision-making models that generalize across diverse tasks and environments.<n>Recent works build robot foundation models by extending multimodal large language models (LMs) with action outputs, creating vision--action (VLA) systems.<n>We explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models.
arXiv Detail & Related papers (2025-12-17T18:35:54Z)
Learning from Watching: Scalable Extraction of Manipulation Trajectories from Human Videos [0.0]
We propose a novel approach that combines large foundation models for video understanding with point tracking techniques to extract dense trajectories of all task-relevant keypoints during manipulation.<n> Experimental results demonstrate that our method can accurately track keypoints throughout the entire manipulation process, paving the way for more scalable and data-efficient robot learning.
arXiv Detail & Related papers (2025-11-03T02:47:38Z)
Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos [42.86535655563404]
We develop a fully-automated holistic human activity analysis approach for arbitrary human hand videos.<n>We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames.<n>We design a dexterous hand VLA model architecture and pretrain the model on this dataset.
arXiv Detail & Related papers (2025-10-24T15:39:31Z)
$π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z)
A Brief Survey on Leveraging Large Scale Vision Models for Enhanced Robot Grasping [4.7079226008262145]
Robotic grasping presents a difficult motor task in real-world scenarios. Recent advancements in computer vision have witnessed a growth of successful unsupervised training mechanisms. We investigate the potential benefits of large-scale visual pretraining in enhancing robot grasping performance.
arXiv Detail & Related papers (2024-06-17T17:39:30Z)
Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z)
Learning by Watching: A Review of Video-based Learning Approaches for Robot Manipulation [0.0]
Recent works have explored learning manipulation skills by passively watching abundant videos sourced online. This survey reviews foundations such as video feature representation learning techniques, object affordance understanding, 3D hand/body modeling, and large-scale robot resources. We discuss how learning only from observing large-scale human videos can enhance generalization and sample efficiency for robotic manipulation.
arXiv Detail & Related papers (2024-02-11T08:41:42Z)
Robotic Offline RL from Internet Videos via Value-Function Pre-Training [67.44673316943475]
We develop a system for leveraging large-scale human video datasets in robotic offline RL. We show that value learning on video datasets learns representations more conducive to downstream robotic offline RL than other approaches.
arXiv Detail & Related papers (2023-09-22T17:59:14Z)
Scaling Robot Learning with Semantically Imagined Experience [21.361979238427722]
Recent advances in robot learning have shown promise in enabling robots to perform manipulation tasks. One of the key contributing factors to this progress is the scale of robot data used to train the models. We propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing.
arXiv Detail & Related papers (2023-02-22T18:47:51Z)
RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z)
Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills [93.12417203541948]
We propose the objective of learning a functional understanding of the environment by learning to reach any goal state in a given dataset. We find that our method can operate on high-dimensional camera images and learn a variety of skills on real robots that generalize to previously unseen scenes and objects.
arXiv Detail & Related papers (2021-04-15T20:10:11Z)
Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task. DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos. DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.