Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos
- URL: http://arxiv.org/abs/2602.13197v1
- Date: Fri, 13 Feb 2026 18:59:10 GMT
- Title: Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos
- Authors: Albert J. Zhai, Kuo-Hao Zeng, Jiasen Lu, Ali Farhadi, Shenlong Wang, Wei-Chiu Ma,
- Abstract summary: We tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions.<n>Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors.<n>We present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data.
- Score: 56.510263910611684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.
Related papers
- Learning Pivoting Manipulation with Force and Vision Feedback Using Optimization-based Demonstrations [20.20969802675097]
We propose a framework for learning closed-loop pivoting manipulation.<n>By leveraging computationally efficient Contact-Implicit Trajectory Optimization, we design demonstration-guided deep Reinforcement Learning.<n>We also present a sim-to-real transfer approach using a privileged training strategy, enabling the robot to perform pivoting manipulation.
arXiv Detail & Related papers (2025-08-01T21:33:46Z) - KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills [58.73043119128804]
This paper presents a physics-based humanoid control framework, aiming to master highly-dynamic human behaviors such as Kungfu and dancing.<n>For motion processing, we design a pipeline to extract, filter out, correct, and retarget motions, while ensuring compliance with physical constraints.<n>For motion imitation, we formulate a bi-level optimization problem to dynamically adjust the tracking accuracy tolerance.<n>In experiments, we train whole-body control policies to imitate a set of highly-dynamic motions.
arXiv Detail & Related papers (2025-06-15T13:58:53Z) - Trajectory Adaptation using Large Language Models [0.8704964543257245]
Adapting robot trajectories based on human instructions as per new situations is essential for achieving more intuitive and scalable human-robot interactions.<n>This work proposes a flexible language-based framework to adapt generic robotic trajectories produced by off-the-shelf motion planners.<n>We utilize pre-trained LLMs to adapt trajectory waypoints by generating code as a policy for dense robot manipulation.
arXiv Detail & Related papers (2025-04-17T08:48:23Z) - MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos [43.836197294180316]
We present MAPLE, a novel method for dexterous robotic manipulation that exploits rich manipulation priors to enable efficient policy learning.<n>Specifically, we predict hand-object contact points and detailed hand poses at the moment of hand-object contact and use the learned features to train policies for downstream manipulation tasks.
arXiv Detail & Related papers (2025-04-08T14:25:25Z) - Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos [101.26467307473638]
We introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer.<n>We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge.<n>To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control.
arXiv Detail & Related papers (2024-12-05T18:57:04Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z) - Model Predictive Control for Fluid Human-to-Robot Handovers [50.72520769938633]
Planning motions that take human comfort into account is not a part of the human-robot handover process.
We propose to generate smooth motions via an efficient model-predictive control framework.
We conduct human-to-robot handover experiments on a diverse set of objects with several users.
arXiv Detail & Related papers (2022-03-31T23:08:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.