Related papers: Prediction with Action: Visual Policy Learning via Joint Denoising Process

Prediction with Action: Visual Policy Learning via Joint Denoising Process

URL: http://arxiv.org/abs/2411.18179v1
Date: Wed, 27 Nov 2024 09:54:58 GMT
Title: Prediction with Action: Visual Policy Learning via Joint Denoising Process
Authors: Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, Jianyu Chen,
Abstract summary: PAD is a visual policy learning framework that unifies image Prediction and robot Action.<n>DiT seamlessly integrates images and robot states, enabling the simultaneous prediction of future images and robot actions.<n>Pad outperforms previous methods, achieving a significant 26.3% relative improvement on the full Metaworld benchmark.
Score: 14.588908033404474
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models have demonstrated remarkable capabilities in image generation tasks, including image editing and video creation, representing a good understanding of the physical world. On the other line, diffusion models have also shown promise in robotic control tasks by denoising actions, known as diffusion policy. Although the diffusion generative model and diffusion policy exhibit distinct capabilities--image prediction and robotic action, respectively--they technically follow a similar denoising process. In robotic tasks, the ability to predict future images and generate actions is highly correlated since they share the same underlying dynamics of the physical world. Building on this insight, we introduce PAD, a novel visual policy learning framework that unifies image Prediction and robot Action within a joint Denoising process. Specifically, PAD utilizes Diffusion Transformers (DiT) to seamlessly integrate images and robot states, enabling the simultaneous prediction of future images and robot actions. Additionally, PAD supports co-training on both robotic demonstrations and large-scale video datasets and can be easily extended to other robotic modalities, such as depth images. PAD outperforms previous methods, achieving a significant 26.3% relative improvement on the full Metaworld benchmark, by utilizing a single text-conditioned visual policy within a data-efficient imitation learning setting. Furthermore, PAD demonstrates superior generalization to unseen tasks in real-world robot manipulation settings with 28.0% success rate increase compared to the strongest baseline. Project page at https://sites.google.com/view/pad-paper

Related papers

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets [7.667819384855409]
We present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. By simply controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning.
arXiv Detail & Related papers (2025-04-03T17:38:59Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences. Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Unified Video Action Model [47.88377984526902]
A unified video and action model holds significant promise for robotics, where videos provide rich scene information for action prediction. We introduce the Unified Video Action model (UVA), which jointly optimize video and action predictions to achieve both high accuracy and efficient action inference. Via an extensive set of experiments, we demonstrate that UVA can serve as a general-purpose solution for a wide range of robotics tasks.
arXiv Detail & Related papers (2025-02-28T21:38:17Z)
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations [19.45821593625599]
Video diffusion models (VDMs) have demonstrated the capability to accurately predict future image sequences. We propose the Video Prediction Policy (VPP), a generalist robotic policy conditioned on the predictive visual representations from VDMs. VPP consistently outperforms existing methods across two simulated and two real-world benchmarks.
arXiv Detail & Related papers (2024-12-19T12:48:40Z)
VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency. Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z)
Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA) LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z)
Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction. The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z)
Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z)
Movement Primitive Diffusion: Learning Gentle Robotic Manipulation of Deformable Objects [14.446751610174868]
Movement Primitive Diffusion (MPD) is a novel method for imitation learning (IL) in robot-assisted surgery. MPD combines the versatility of diffusion-based imitation learning (DIL) with the high-quality motion generation capabilities of Probabilistic Dynamic Movement Primitives (ProDMPs) We evaluate MPD across various simulated and real world robotic tasks on both state and image observations.
arXiv Detail & Related papers (2023-12-15T18:24:28Z)
Learning to Act from Actionless Videos through Dense Correspondences [87.1243107115642]
We present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments. Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks.
arXiv Detail & Related papers (2023-10-12T17:59:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.