Action Images: End-to-End Policy Learning via Multiview Video Generation
Abstract Overview
The paper introduces Action Images, a unified world-action model that formulates robot policy learning as multiview video generation. It converts each 7-DoF robot action (end-effector position, orientation, gripper openness) into pixel-grounded multiview action images by projecting three semantic 3D points into image space and rendering them as RGB Gaussian heatmaps. A pretrained video generator (Wan 2.2) is fine-tuned to jointly model observation videos and action videos under a shared representation, using masking strategies that support joint generation, action-conditioned video generation, video-to-action labeling, and video-only generation. Experiments on RLBench and real-world robotic settings with an xArm robot demonstrate improved zero-shot policy success rates and stronger video-action joint generation quality compared to several world-model and policy baselines.
Novelty
The main novelty is representing robot control as interpretable, pixel-grounded multiview action images—RGB heatmaps encoding end-effector position, orientation, and gripper state—making action native to the same video space as observations. This allows a single video backbone to function as a zero-shot policy without requiring a separate policy head or action module, while also unifying joint generation, action-conditioned video generation, and video-to-action labeling in one model.
Results
In zero-shot evaluation, the method achieves the highest overall task success rates across both RLBench and real-world xArm settings among the compared baselines (e.g., 60% on reach target and 50% on close drawer in RLBench, versus at most 5% and 35% for baselines). For joint video-action generation, it reports the best video metrics (PSNR 23.48, SSIM 78.62%, FVD 143.74, LPIPS 0.209) while achieving competitive action accuracy (3D error 12.2×10⁻³), and it also outperforms task-specific baselines on action-conditioned video generation and video-to-action labeling.
Key Points
- Action Images converts 7-DoF robot actions into multiview RGB Gaussian heatmap videos that explicitly encode end-effector position, orientation (via normal and up points), and gripper openness in pixel space, with a geometric decoder that recovers continuous 7-DoF actions via ray casting and multi-view matching.
- The model is trained as a unified world-action generator on a mixture of RLBench, DROID, and BridgeV2 data using masking-based objectives over a fine-tuned Wan 2.2 backbone, enabling one model to handle joint generation, action-conditioned video prediction, video-to-action labeling, and video-only modeling.
- Empirically, the approach achieves higher zero-shot task success rates than compared policy and world-model baselines (including π₀.₅, MolmoAct, TesserAct, and Cosmos-Policy) across RLBench and real-world settings, with particularly notable gains under distribution shift involving unseen objects and environments.
References
- arXiv: https://arxiv.org/abs/2604.06168v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.06168v1
- Hugging Face Papers: https://huggingface.co/papers/2604.06168
- Project: https://ActionImages.github.io