MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence
- URL: http://arxiv.org/abs/2508.13534v1
- Date: Tue, 19 Aug 2025 05:49:47 GMT
- Title: MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence
- Authors: Chao Tang, Anxing Xiao, Yuhong Deng, Tianrun Hu, Wenlong Dong, Hanbo Zhang, David Hsu, Hong Zhang,
- Abstract summary: Imitating tool manipulation from human videos offers an intuitive approach to teaching robots.<n>We propose MimicFunc, a framework that establishes functional correspondences with function frame.<n>MimicFunc effectively enables the robot to generalize the skill from a single RGB-D human video to manipulating novel tools.
- Score: 18.953496415412335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imitating tool manipulation from human videos offers an intuitive approach to teaching robots, while also providing a promising and scalable alternative to labor-intensive teleoperation data collection for visuomotor policy learning. While humans can mimic tool manipulation behavior by observing others perform a task just once and effortlessly transfer the skill to diverse tools for functionally equivalent tasks, current robots struggle to achieve this level of generalization. A key challenge lies in establishing function-level correspondences, considering the significant geometric variations among functionally similar tools, referred to as intra-function variations. To address this challenge, we propose MimicFunc, a framework that establishes functional correspondences with function frame, a function-centric local coordinate frame constructed with keypoint-based abstraction, for imitating tool manipulation skills. Experiments demonstrate that MimicFunc effectively enables the robot to generalize the skill from a single RGB-D human video to manipulating novel tools for functionally equivalent tasks. Furthermore, leveraging MimicFunc's one-shot generalization capability, the generated rollouts can be used to train visuomotor policies without requiring labor-intensive teleoperation data collection for novel objects. Our code and video are available at https://sites.google.com/view/mimicfunc.
Related papers
- FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation [25.631729484747087]
We introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks.<n>These chunks focus policy learning on the actions themselves, rather than isolated tasks.<n>Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment.
arXiv Detail & Related papers (2025-09-23T14:49:05Z) - Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models [49.4824734958566]
Chain-of-Modality (CoM) enables Vision Language Models to reason about multimodal human demonstration data.<n>CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt.
arXiv Detail & Related papers (2025-04-17T21:31:23Z) - FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation [18.953496415412335]
FUNCTO is an OSIL method that establishes function-centric correspondences with a 3D functional keypoint representation.<n>We evaluate FUNCTO against exiting modular OSIL methods and end-to-end behavioral cloning methods.
arXiv Detail & Related papers (2025-02-17T12:34:42Z) - Learning Granularity-Aware Affordances from Human-Object Interaction for Tool-Based Functional Dexterous Grasping [27.124273762587848]
Affordance features of objects serve as a bridge in the functional interaction between agents and objects.<n>We propose a granularity-aware affordance feature extraction method for locating functional affordance areas.<n>We use highly activated coarse-grained affordance features in hand-object interaction regions to predict grasp gestures.<n>This forms GAAF-Dex, a complete framework that learns Granularity-Aware Affordances from human-object interaction.
arXiv Detail & Related papers (2024-06-30T07:42:57Z) - Learning Reusable Manipulation Strategies [86.07442931141634]
Humans demonstrate an impressive ability to acquire and generalize manipulation "tricks"
We present a framework that enables machines to acquire such manipulation skills through a single demonstration and self-play.
These learned mechanisms and samplers can be seamlessly integrated into standard task and motion planners.
arXiv Detail & Related papers (2023-11-06T17:35:42Z) - Dexterous Manipulation from Images: Autonomous Real-World RL via Substep
Guidance [71.36749876465618]
We describe a system for vision-based dexterous manipulation that provides a "programming-free" approach for users to define new tasks.
Our system includes a framework for users to define a final task and intermediate sub-tasks with image examples.
experimental results with a four-finger robotic hand learning multi-stage object manipulation tasks directly in the real world.
arXiv Detail & Related papers (2022-12-19T22:50:40Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z) - Learning Tool Morphology for Contact-Rich Manipulation Tasks with
Differentiable Simulation [27.462052737553055]
We present an end-to-end framework to automatically learn tool morphology for contact-rich manipulation tasks by leveraging differentiable physics simulators.
In our approach, we instead only need to define the objective with respect to the task performance and enable learning a robust morphology by randomizing the task variations.
We demonstrate the effectiveness of our method for designing new tools in several scenarios such as winding ropes, flipping a box and pushing peas onto a scoop in simulation.
arXiv Detail & Related papers (2022-11-04T00:57:36Z) - How to select and use tools? : Active Perception of Target Objects Using
Multimodal Deep Learning [9.677391628613025]
We focus on active perception using multimodal sensorimotor data while a robot interacts with objects.
We construct a deep neural networks (DNN) model that learns to recognize object characteristics.
We also examine the contributions of images, force, and tactile data and show that learning a variety of multimodal information results in rich perception for tool use.
arXiv Detail & Related papers (2021-06-04T12:49:30Z) - Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching.
Our approach learns entirely using offline, unlabeled data.
We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z) - Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works.
However, learning a model that captures the dynamics of complex skills represents a major challenge.
We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.