Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models
- URL: http://arxiv.org/abs/2504.13351v1
- Date: Thu, 17 Apr 2025 21:31:23 GMT
- Title: Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models
- Authors: Chen Wang, Fei Xia, Wenhao Yu, Tingnan Zhang, Ruohan Zhang, C. Karen Liu, Li Fei-Fei, Jie Tan, Jacky Liang,
- Abstract summary: Chain-of-Modality (CoM) enables Vision Language Models to reason about multimodal human demonstration data.<n>CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt.
- Score: 49.4824734958566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to perform manipulation tasks from human videos is a promising approach for teaching robots. However, many manipulation tasks require changing control parameters during task execution, such as force, which visual data alone cannot capture. In this work, we leverage sensing devices such as armbands that measure human muscle activities and microphones that record sound, to capture the details in the human manipulation process, and enable robots to extract task plans and control parameters to perform the same task. To achieve this, we introduce Chain-of-Modality (CoM), a prompting strategy that enables Vision Language Models to reason about multimodal human demonstration data -- videos coupled with muscle or audio signals. By progressively integrating information from each modality, CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt. Our experiments show that CoM delivers a threefold improvement in accuracy for extracting task plans and control parameters compared to baselines, with strong generalization to new task setups and objects in real-world robot experiments. Videos and code are available at https://chain-of-modality.github.io
Related papers
- ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos [15.809468471562537]
ZeroMimic generates image goal-conditioned skill policies for several common manipulation tasks.
We evaluate ZeroMimic's out-of-the-box performance in varied real-world and simulated kitchen settings.
To enable plug-and-play reuse of ZeroMimic policies on other task setups and robots, we release software and policy checkpoints.
arXiv Detail & Related papers (2025-03-31T09:27:00Z) - Whole-Body Teleoperation for Mobile Manipulation at Zero Added Cost [8.71539730969424]
MoMa-Teleop is a novel teleoperation method that infers end-effector motions from existing interfaces.<n>We demonstrate that our approach results in a significant reduction in task completion time across a variety of robots and tasks.
arXiv Detail & Related papers (2024-09-23T15:09:45Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation.
Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z) - Towards Generalizable Zero-Shot Manipulation via Translating Human
Interaction Plans [58.27029676638521]
We show how passive human videos can serve as a rich source of data for learning such generalist robots.
We learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations.
We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects.
arXiv Detail & Related papers (2023-12-01T18:54:12Z) - Style-transfer based Speech and Audio-visual Scene Understanding for
Robot Action Sequence Acquisition from Videos [40.012813353904875]
This paper introduces a method for robot action sequence generation from instruction videos.
We built a system that accomplishes various cooking actions, where an arm robot executes a sequence acquired from a cooking video.
arXiv Detail & Related papers (2023-06-27T17:37:53Z) - Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills.
We learn our policy to generate appropriate actions given current scene observations and a video of the target task.
We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z) - PLEX: Making the Most of the Available Data for Robotic Manipulation
Pretraining [28.504762473732296]
We propose a transformer-based architecture that learns from a small amount of task-agnostic visuomotor trajectories.
In particular, using relative positional encoding in plex's transformers greatly helps in low-data regimes of learning from human-collected demonstrations.
arXiv Detail & Related papers (2023-03-15T17:31:37Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.