Related papers: MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence

MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence

URL: http://arxiv.org/abs/2508.13534v1
Date: Tue, 19 Aug 2025 05:49:47 GMT
Title: MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence
Authors: Chao Tang, Anxing Xiao, Yuhong Deng, Tianrun Hu, Wenlong Dong, Hanbo Zhang, David Hsu, Hong Zhang,
Abstract summary: Imitating tool manipulation from human videos offers an intuitive approach to teaching robots.<n>We propose MimicFunc, a framework that establishes functional correspondences with function frame.<n>MimicFunc effectively enables the robot to generalize the skill from a single RGB-D human video to manipulating novel tools.
Score: 18.953496415412335
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Imitating tool manipulation from human videos offers an intuitive approach to teaching robots, while also providing a promising and scalable alternative to labor-intensive teleoperation data collection for visuomotor policy learning. While humans can mimic tool manipulation behavior by observing others perform a task just once and effortlessly transfer the skill to diverse tools for functionally equivalent tasks, current robots struggle to achieve this level of generalization. A key challenge lies in establishing function-level correspondences, considering the significant geometric variations among functionally similar tools, referred to as intra-function variations. To address this challenge, we propose MimicFunc, a framework that establishes functional correspondences with function frame, a function-centric local coordinate frame constructed with keypoint-based abstraction, for imitating tool manipulation skills. Experiments demonstrate that MimicFunc effectively enables the robot to generalize the skill from a single RGB-D human video to manipulating novel tools for functionally equivalent tasks. Furthermore, leveraging MimicFunc's one-shot generalization capability, the generated rollouts can be used to train visuomotor policies without requiring labor-intensive teleoperation data collection for novel objects. Our code and video are available at https://sites.google.com/view/mimicfunc.

Related papers

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation [25.631729484747087]
We introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks.<n>These chunks focus policy learning on the actions themselves, rather than isolated tasks.<n>Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment.
arXiv Detail & Related papers (2025-09-23T14:49:05Z)
Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models [49.4824734958566]
Chain-of-Modality (CoM) enables Vision Language Models to reason about multimodal human demonstration data.<n>CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt.
arXiv Detail & Related papers (2025-04-17T21:31:23Z)
FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation [18.953496415412335]
FUNCTO is an OSIL method that establishes function-centric correspondences with a 3D functional keypoint representation.<n>We evaluate FUNCTO against exiting modular OSIL methods and end-to-end behavioral cloning methods.
arXiv Detail & Related papers (2025-02-17T12:34:42Z)
Learning Granularity-Aware Affordances from Human-Object Interaction for Tool-Based Functional Dexterous Grasping [27.124273762587848]
Affordance features of objects serve as a bridge in the functional interaction between agents and objects.<n>We propose a granularity-aware affordance feature extraction method for locating functional affordance areas.<n>We use highly activated coarse-grained affordance features in hand-object interaction regions to predict grasp gestures.<n>This forms GAAF-Dex, a complete framework that learns Granularity-Aware Affordances from human-object interaction.
arXiv Detail & Related papers (2024-06-30T07:42:57Z)
Learning Reusable Manipulation Strategies [86.07442931141634]
Humans demonstrate an impressive ability to acquire and generalize manipulation "tricks" We present a framework that enables machines to acquire such manipulation skills through a single demonstration and self-play. These learned mechanisms and samplers can be seamlessly integrated into standard task and motion planners.
arXiv Detail & Related papers (2023-11-06T17:35:42Z)
Dexterous Manipulation from Images: Autonomous Real-World RL via Substep Guidance [71.36749876465618]
We describe a system for vision-based dexterous manipulation that provides a "programming-free" approach for users to define new tasks. Our system includes a framework for users to define a final task and intermediate sub-tasks with image examples. experimental results with a four-finger robotic hand learning multi-stage object manipulation tasks directly in the real world.
arXiv Detail & Related papers (2022-12-19T22:50:40Z)
Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies. The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z)
Learning Tool Morphology for Contact-Rich Manipulation Tasks with Differentiable Simulation [27.462052737553055]
We present an end-to-end framework to automatically learn tool morphology for contact-rich manipulation tasks by leveraging differentiable physics simulators. In our approach, we instead only need to define the objective with respect to the task performance and enable learning a robust morphology by randomizing the task variations. We demonstrate the effectiveness of our method for designing new tools in several scenarios such as winding ropes, flipping a box and pushing peas onto a scoop in simulation.
arXiv Detail & Related papers (2022-11-04T00:57:36Z)
How to select and use tools? : Active Perception of Target Objects Using Multimodal Deep Learning [9.677391628613025]
We focus on active perception using multimodal sensorimotor data while a robot interacts with objects. We construct a deep neural networks (DNN) model that learns to recognize object characteristics. We also examine the contributions of images, force, and tactile data and show that learning a variety of multimodal information results in rich perception for tool use.
arXiv Detail & Related papers (2021-06-04T12:49:30Z)
Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching. Our approach learns entirely using offline, unlabeled data. We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z)
Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works. However, learning a model that captures the dynamics of complex skills represents a major challenge. We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.