VideoVLA: Video Generators Can Be Generalizable Robot Manipulators
- URL: http://arxiv.org/abs/2512.06963v1
- Date: Sun, 07 Dec 2025 18:57:15 GMT
- Title: VideoVLA: Video Generators Can Be Generalizable Robot Manipulators
- Authors: Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, Baining Guo,
- Abstract summary: Generalization in robot manipulation is essential for deploying robots in open-world environments.<n>We present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators.
- Score: 86.70243911696616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy - forecasting both actions and their visual consequences - explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.
Related papers
- GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning [20.646039344274556]
GeneralVLA is a hierarchical vision-language-action (VLA) model that can be more effective in utilizing the generalization of foundation models.<n>GeneralVLA successfully generates trajectories for 14 tasks, significantly outperforming state-of-the-art methods such as VoxPoser.
arXiv Detail & Related papers (2026-02-04T08:30:27Z) - Large Video Planner Enables Generalizable Robot Control [117.49024534548319]
General-purpose robots require decision-making models that generalize across diverse tasks and environments.<n>Recent works build robot foundation models by extending multimodal large language models (LMs) with action outputs, creating vision--action (VLA) systems.<n>We explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models.
arXiv Detail & Related papers (2025-12-17T18:35:54Z) - UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning [22.84748754972181]
Building generalist robot policies that can handle diverse tasks in open-ended environments is a central challenge in robotics.<n>To leverage knowledge from large-scale pretraining, prior work has typically built generalist policies either on top of vision-language understanding models (VLMs) or generative models.<n>Recent unified models of generation and understanding have demonstrated strong capabilities in both comprehension and generation through large-scale pretraining.<n>We introduce UniCoD, which acquires the ability to dynamically model high-dimensional visual features through pretraining on over 1M internet-scale instructional manipulation videos.
arXiv Detail & Related papers (2025-10-12T14:54:19Z) - OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation [68.11862866566817]
3D-aware policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects.<n>We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies.
arXiv Detail & Related papers (2025-06-01T22:15:45Z) - VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z) - Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA)<n>LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.<n>We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.