PlayFusion: Skill Acquisition via Diffusion from Language-Annotated Play
- URL: http://arxiv.org/abs/2312.04549v1
- Date: Thu, 7 Dec 2023 18:59:14 GMT
- Title: PlayFusion: Skill Acquisition via Diffusion from Language-Annotated Play
- Authors: Lili Chen, Shikhar Bahl, Deepak Pathak
- Abstract summary: Learning from unstructured and uncurated data has become the dominant paradigm for generative approaches in language and vision.
We study this problem of learning goal-directed skill policies from unstructured play data which is labeled with language in hindsight.
Specifically, we leverage advances in diffusion models to learn a multi-task diffusion model to extract robotic skills from play data.
- Score: 47.052953955624886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning from unstructured and uncurated data has become the dominant
paradigm for generative approaches in language and vision. Such unstructured
and unguided behavior data, commonly known as play, is also easier to collect
in robotics but much more difficult to learn from due to its inherently
multimodal, noisy, and suboptimal nature. In this paper, we study this problem
of learning goal-directed skill policies from unstructured play data which is
labeled with language in hindsight. Specifically, we leverage advances in
diffusion models to learn a multi-task diffusion model to extract robotic
skills from play data. Using a conditional denoising diffusion process in the
space of states and actions, we can gracefully handle the complexity and
multimodality of play data and generate diverse and interesting robot
behaviors. To make diffusion models more useful for skill learning, we
encourage robotic agents to acquire a vocabulary of skills by introducing
discrete bottlenecks into the conditional behavior generation process. In our
experiments, we demonstrate the effectiveness of our approach across a wide
variety of environments in both simulation and the real world. Results
visualizations and videos at https://play-fusion.github.io
Related papers
- Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets [7.667819384855409]
We present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning.
By simply controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator.
Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning.
arXiv Detail & Related papers (2025-04-03T17:38:59Z) - NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models [36.05972290909729]
We propose a data-independent approach for skill acquisition that learns 3D motor skills from 2D-generated videos.
In humanoid robot tasks, we demonstrate that 'No-data Imitation Learning' (NIL) outperforms baselines trained on 3D motion-capture data.
arXiv Detail & Related papers (2025-03-13T17:59:24Z) - Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions.
We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms.
Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z) - Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets.
We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos.
Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution [75.2573501625811]
Diffusion models have demonstrated strong potential for robotic trajectory planning.
generating coherent trajectories from high-level instructions remains challenging.
We propose SkillDiffuser, an end-to-end hierarchical planning framework.
arXiv Detail & Related papers (2023-12-18T18:16:52Z) - Diffusion Language Models Can Perform Many Tasks with Scaling and
Instruction-Finetuning [56.03057119008865]
We show that scaling diffusion language models can effectively make them strong language learners.
We build competent diffusion language models at scale by first acquiring knowledge from massive data.
Experiments show that scaling diffusion language models consistently improves performance across downstream language tasks.
arXiv Detail & Related papers (2023-08-23T16:01:12Z) - XSkill: Cross Embodiment Skill Discovery [41.624343257852146]
XSkill is an imitation learning framework that discovers a cross-embodiment representation called skill prototypes purely from unlabeled human and robot manipulation videos.
Our experiments in simulation and real-world environments show that the discovered skill prototypes facilitate skill transfer and composition for unseen tasks.
arXiv Detail & Related papers (2023-07-19T12:51:28Z) - Scaling Robot Learning with Semantically Imagined Experience [21.361979238427722]
Recent advances in robot learning have shown promise in enabling robots to perform manipulation tasks.
One of the key contributing factors to this progress is the scale of robot data used to train the models.
We propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing.
arXiv Detail & Related papers (2023-02-22T18:47:51Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.