Cross-Modal Instructions for Robot Motion Generation
- URL: http://arxiv.org/abs/2509.21107v1
- Date: Thu, 25 Sep 2025 12:54:00 GMT
- Title: Cross-Modal Instructions for Robot Motion Generation
- Authors: William Barron, Xiaoxiang Dong, Matthew Johnson-Roberson, Weiming Zhi,
- Abstract summary: We introduce Learning from Cross-Modal Instructions, where robots are shaped by demonstrations in the form of rough annotations.<n>We introduce the CrossInstruct framework, which integrates cross-modal instructions as examples into the context input to a vision-language model.<n>The VLM then iteratively queries a smaller, fine-tuned model, and synthesizes the desired motion over multiple 2D views.<n>By incorporating the reasoning of the large VLM with a fine-grained pointing model, CrossInstruct produces executable robot behaviors that generalize beyond the environment of in the limited set of instruction examples.
- Score: 7.445072780282545
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Teaching robots novel behaviors typically requires motion demonstrations via teleoperation or kinaesthetic teaching, that is, physically guiding the robot. While recent work has explored using human sketches to specify desired behaviors, data collection remains cumbersome, and demonstration datasets are difficult to scale. In this paper, we introduce an alternative paradigm, Learning from Cross-Modal Instructions, where robots are shaped by demonstrations in the form of rough annotations, which can contain free-form text labels, and are used in lieu of physical motion. We introduce the CrossInstruct framework, which integrates cross-modal instructions as examples into the context input to a foundational vision-language model (VLM). The VLM then iteratively queries a smaller, fine-tuned model, and synthesizes the desired motion over multiple 2D views. These are then subsequently fused into a coherent distribution over 3D motion trajectories in the robot's workspace. By incorporating the reasoning of the large VLM with a fine-grained pointing model, CrossInstruct produces executable robot behaviors that generalize beyond the environment of in the limited set of instruction examples. We then introduce a downstream reinforcement learning pipeline that leverages CrossInstruct outputs to efficiently learn policies to complete fine-grained tasks. We rigorously evaluate CrossInstruct on benchmark simulation tasks and real hardware, demonstrating effectiveness without additional fine-tuning and providing a strong initialization for policies subsequently refined via reinforcement learning.
Related papers
- AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making [35.83717913117858]
AntiGrounding is a new framework that reverses the instruction grounding process.<n>It lifts candidate actions directly into the VLM representation space.<n>It renders trajectories from multiple views, and uses structured visual question answering for instruction-based decision making.
arXiv Detail & Related papers (2025-06-14T07:11:44Z) - Instant Policy: In-Context Imitation Learning via Graph Diffusion [12.879700241782528]
In-context Imitation Learning (ICIL) is a promising opportunity for robotics.<n>We introduce Instant Policy, which learns new tasks instantly from just one or two demonstrations.<n>We also show how it can serve as a foundation for cross-embodiment and zero-shot transfer to language-defined tasks.
arXiv Detail & Related papers (2024-11-19T16:45:52Z) - Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA)<n>LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.<n>We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Instructing Robots by Sketching: Learning from Demonstration via Probabilistic Diagrammatic Teaching [14.839036866911089]
Learning for Demonstration (LfD) enables robots to imitate expert demonstrations, allowing users to communicate their instructions in an intuitive manner.
Recent progress in LfD often relies on kinesthetic teaching or teleoperation as the medium for users to specify the demonstrations.
This paper introduces an alternative paradigm for LfD called Diagrammatic Teaching.
arXiv Detail & Related papers (2023-09-07T16:49:38Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.