FuguReport

Action Images: End-to-End Policy Learning via Multiview Video Generation

Authors Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Tsun-Hsuan Wang, Yi-Ling Qiao, Chuang Gan
Affiliations Genesis AI / UMass Amherst / The University of Tokyo / Harvard University / NVIDIA
Categories Method / Policy Learning / Integrated world action model, Application / Video Generation / Multiview action video synthesis, Evaluation / Model Evaluation / Zero-shot success rate assessment
License CC BY 4.0

Abstract Overview

The paper introduces Action Images, a unified world-action model that formulates robot policy learning as multiview video generation. It converts each 7-DoF robot action (end-effector position, orientation, gripper openness) into pixel-grounded multiview action images by projecting three semantic 3D points into image space and rendering them as RGB Gaussian heatmaps. A pretrained video generator (Wan 2.2) is fine-tuned to jointly model observation videos and action videos under a shared representation, using masking strategies that support joint generation, action-conditioned video generation, video-to-action labeling, and video-only generation. Experiments on RLBench and real-world robotic settings with an xArm robot demonstrate improved zero-shot policy success rates and stronger video-action joint generation quality compared to several world-model and policy baselines.

Novelty

The main novelty is representing robot control as interpretable, pixel-grounded multiview action images—RGB heatmaps encoding end-effector position, orientation, and gripper state—making action native to the same video space as observations. This allows a single video backbone to function as a zero-shot policy without requiring a separate policy head or action module, while also unifying joint generation, action-conditioned video generation, and video-to-action labeling in one model.

Results

In zero-shot evaluation, the method achieves the highest overall task success rates across both RLBench and real-world xArm settings among the compared baselines (e.g., 60% on reach target and 50% on close drawer in RLBench, versus at most 5% and 35% for baselines). For joint video-action generation, it reports the best video metrics (PSNR 23.48, SSIM 78.62%, FVD 143.74, LPIPS 0.209) while achieving competitive action accuracy (3D error 12.2×10⁻³), and it also outperforms task-specific baselines on action-conditioned video generation and video-to-action labeling.

Key Points

  1. Action Images converts 7-DoF robot actions into multiview RGB Gaussian heatmap videos that explicitly encode end-effector position, orientation (via normal and up points), and gripper openness in pixel space, with a geometric decoder that recovers continuous 7-DoF actions via ray casting and multi-view matching.
  2. The model is trained as a unified world-action generator on a mixture of RLBench, DROID, and BridgeV2 data using masking-based objectives over a fine-tuned Wan 2.2 backbone, enabling one model to handle joint generation, action-conditioned video prediction, video-to-action labeling, and video-only modeling.
  3. Empirically, the approach achieves higher zero-shot task success rates than compared policy and world-model baselines (including π₀.₅, MolmoAct, TesserAct, and Cosmos-Policy) across RLBench and real-world settings, with particularly notable gains under distribution shift involving unseen objects and environments.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.