Large Video Planner Enables Generalizable Robot Control
- URL: http://arxiv.org/abs/2512.15840v1
- Date: Wed, 17 Dec 2025 18:35:54 GMT
- Title: Large Video Planner Enables Generalizable Robot Control
- Authors: Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, Yilun Du,
- Abstract summary: General-purpose robots require decision-making models that generalize across diverse tasks and environments.<n>Recent works build robot foundation models by extending multimodal large language models (LMs) with action outputs, creating vision--action (VLA) systems.<n>We explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models.
- Score: 117.49024534548319
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.
Related papers
- VideoVLA: Video Generators Can Be Generalizable Robot Manipulators [86.70243911696616]
Generalization in robot manipulation is essential for deploying robots in open-world environments.<n>We present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators.
arXiv Detail & Related papers (2025-12-07T18:57:15Z) - Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos [42.86535655563404]
We develop a fully-automated holistic human activity analysis approach for arbitrary human hand videos.<n>We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames.<n>We design a dexterous hand VLA model architecture and pretrain the model on this dataset.
arXiv Detail & Related papers (2025-10-24T15:39:31Z) - VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z) - Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA)<n>LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.<n>We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human
Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task.
DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos.
DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.