Related papers: Learning from Watching: Scalable Extraction of Manipulation Trajectories from Human Videos

Learning from Watching: Scalable Extraction of Manipulation Trajectories from Human Videos

URL: http://arxiv.org/abs/2512.00024v1
Date: Mon, 03 Nov 2025 02:47:38 GMT
Title: Learning from Watching: Scalable Extraction of Manipulation Trajectories from Human Videos
Authors: X. Hu, G. Ye,
Abstract summary: We propose a novel approach that combines large foundation models for video understanding with point tracking techniques to extract dense trajectories of all task-relevant keypoints during manipulation.<n> Experimental results demonstrate that our method can accurately track keypoints throughout the entire manipulation process, paving the way for more scalable and data-efficient robot learning.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Collecting high-quality data for training large-scale robotic models typically relies on real robot platforms, which is labor-intensive and costly, whether via teleoperation or scripted demonstrations. To scale data collection, many researchers have turned to leveraging human manipulation videos available online. However, current methods predominantly focus on hand detection or object pose estimation, failing to fully exploit the rich interaction cues embedded in these videos. In this work, we propose a novel approach that combines large foundation models for video understanding with point tracking techniques to extract dense trajectories of all task-relevant keypoints during manipulation. This enables more comprehensive utilization of Internet-scale human demonstration videos. Experimental results demonstrate that our method can accurately track keypoints throughout the entire manipulation process, paving the way for more scalable and data-efficient robot learning.

Related papers

VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z)
Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal. We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z)
Learning by Watching: A Review of Video-based Learning Approaches for Robot Manipulation [0.0]
Recent works have explored learning manipulation skills by passively watching abundant videos sourced online.<n>This survey reviews foundations such as video feature representation learning techniques, object affordance understanding, 3D hand/body modeling, and large-scale robot resources.<n>We discuss how learning only from observing large-scale human videos can enhance generalization and sample efficiency for robotic manipulation.
arXiv Detail & Related papers (2024-02-11T08:41:42Z)
Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame. ATM outperforms strong video pre-training baselines by 80% on average. We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z)
Scaling Robot Learning with Semantically Imagined Experience [21.361979238427722]
Recent advances in robot learning have shown promise in enabling robots to perform manipulation tasks. One of the key contributing factors to this progress is the scale of robot data used to train the models. We propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing.
arXiv Detail & Related papers (2023-02-22T18:47:51Z)
Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies. The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z)
Learning by Watching: Physical Imitation of Manipulation Skills from Human Videos [28.712673809577076]
We present an approach for physical imitation from human videos for robot manipulation tasks. We design a perception module that learns to translate human videos to the robot domain followed by unsupervised keypoint detection. We evaluate the effectiveness of our approach on five robot manipulation tasks, including reaching, pushing, sliding, coffee making, and drawer closing.
arXiv Detail & Related papers (2021-01-18T18:50:32Z)
Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots. We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector. We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)
Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works. However, learning a model that captures the dynamics of complex skills represents a major challenge. We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.