Learning Human Skill Generators at Key-Step Levels
- URL: http://arxiv.org/abs/2502.08234v1
- Date: Wed, 12 Feb 2025 09:21:40 GMT
- Title: Learning Human Skill Generators at Key-Step Levels
- Authors: Yilu Wu, Chenhui Zhu, Shuai Wang, Hanlin Wang, Jing Wang, Zhaoxiang Zhang, Limin Wang,
- Abstract summary: Key-step Skill Generation (KS-Gen) aims at reducing the complexity of generating human skill videos.
Given the initial state and a skill description, the task is to generate video clips of key steps to complete the skill.
Considering the complexity of KS-Gen, we propose a new framework for this task.
- Score: 56.91737190115577
- License:
- Abstract: We are committed to learning human skill generators at key-step levels. The generation of skills is a challenging endeavor, but its successful implementation could greatly facilitate human skill learning and provide more experience for embodied intelligence. Although current video generation models can synthesis simple and atomic human operations, they struggle with human skills due to their complex procedure process. Human skills involve multi-step, long-duration actions and complex scene transitions, so the existing naive auto-regressive methods for synthesizing long videos cannot generate human skills. To address this, we propose a novel task, the Key-step Skill Generation (KS-Gen), aimed at reducing the complexity of generating human skill videos. Given the initial state and a skill description, the task is to generate video clips of key steps to complete the skill, rather than a full-length video. To support this task, we introduce a carefully curated dataset and define multiple evaluation metrics to assess performance. Considering the complexity of KS-Gen, we propose a new framework for this task. First, a multimodal large language model (MLLM) generates descriptions for key steps using retrieval argument. Subsequently, we use a Key-step Image Generator (KIG) to address the discontinuity between key steps in skill videos. Finally, a video generation model uses these descriptions and key-step images to generate video clips of the key steps with high temporal consistency. We offer a detailed analysis of the results, hoping to provide more insights on human skill generation. All models and data are available at https://github.com/MCG-NJU/KS-Gen.
Related papers
- VILP: Imitation Learning with Latent Video Planning [19.25411361966752]
This paper introduces imitation learning with latent video planning (VILP)
Our method is able to generate highly time-aligned videos from multiple views.
Our paper provides a practical example of how to effectively integrate video generation models into robot policies.
arXiv Detail & Related papers (2025-02-03T19:55:57Z) - SkillMimicGen: Automated Demonstration Generation for Efficient Skill Learning and Deployment [33.53559296053225]
We propose SkillMimicGen, an automated system for generating demonstration datasets from a few human demos.
SkillGen segments human demos into manipulation skills, adapts these skills to new contexts, and stitches them together through free-space transit and transfer motion.
We demonstrate the efficacy of SkillGen by generating over 24K demonstrations across 18 task variants in simulation from just 60 human demonstrations.
arXiv Detail & Related papers (2024-10-24T16:59:26Z) - Agentic Skill Discovery [19.5703917813767]
Language-conditioned robotic skills make it possible to apply the high-level reasoning of Large Language Models (LLMs) to low-level robotic control.
A remaining challenge is to acquire a diverse set of fundamental skills.
We introduce a novel framework for skill discovery that is entirely driven by LLMs.
arXiv Detail & Related papers (2024-05-23T19:44:03Z) - Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets.
We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos.
Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z) - XSkill: Cross Embodiment Skill Discovery [41.624343257852146]
XSkill is an imitation learning framework that discovers a cross-embodiment representation called skill prototypes purely from unlabeled human and robot manipulation videos.
Our experiments in simulation and real-world environments show that the discovered skill prototypes facilitate skill transfer and composition for unseen tasks.
arXiv Detail & Related papers (2023-07-19T12:51:28Z) - RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in
One-Shot [56.130215236125224]
A key challenge in robotic manipulation in open domains is how to acquire diverse and generalizable skills for robots.
Recent research in one-shot imitation learning has shown promise in transferring trained policies to new tasks based on demonstrations.
This paper aims to unlock the potential for an agent to generalize to hundreds of real-world skills with multi-modal perception.
arXiv Detail & Related papers (2023-07-02T15:33:31Z) - Bottom-Up Skill Discovery from Unsegmented Demonstrations for
Long-Horizon Robot Manipulation [55.31301153979621]
We tackle real-world long-horizon robot manipulation tasks through skill discovery.
We present a bottom-up approach to learning a library of reusable skills from unsegmented demonstrations.
Our method has shown superior performance over state-of-the-art imitation learning methods in multi-stage manipulation tasks.
arXiv Detail & Related papers (2021-09-28T16:18:54Z) - Actionable Models: Unsupervised Offline Reinforcement Learning of
Robotic Skills [93.12417203541948]
We propose the objective of learning a functional understanding of the environment by learning to reach any goal state in a given dataset.
We find that our method can operate on high-dimensional camera images and learn a variety of skills on real robots that generalize to previously unseen scenes and objects.
arXiv Detail & Related papers (2021-04-15T20:10:11Z) - Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human
Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task.
DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos.
DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.