Related papers: Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning

Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning

URL: http://arxiv.org/abs/2410.01529v1
Date: Wed, 2 Oct 2024 13:23:02 GMT
Title: Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning
Authors: Jianxiong Li, Zhihao Wang, Jinliang Zheng, Xiaoai Zhou, Guanming Wang, Guanglu Song, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Junzhi Yu, Xianyuan Zhan,
Abstract summary: Multimodal task specification is essential for enhanced robotic performance. We show that by leveraging unimodal instructions abundant in real data, we can effectively teach robots to learn multimodal task specifications.
Score: 35.42091835421386
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Multimodal task specification is essential for enhanced robotic performance, where \textit{Cross-modality Alignment} enables the robot to holistically understand complex task instructions. Directly annotating multimodal instructions for model training proves impractical, due to the sparsity of paired multimodal data. In this study, we demonstrate that by leveraging unimodal instructions abundant in real data, we can effectively teach robots to learn multimodal task specifications. First, we endow the robot with strong \textit{Cross-modality Alignment} capabilities, by pretraining a robotic multimodal encoder using extensive out-of-domain data. Then, we employ two Collapse and Corrupt operations to further bridge the remaining modality gap in the learned multimodal representation. This approach projects different modalities of identical task goal as interchangeable representations, thus enabling accurate robotic operations within a well-aligned multimodal latent space. Evaluation across more than 130 tasks and 4000 evaluations on both simulated LIBERO benchmark and real robot platforms showcases the superior capabilities of our proposed framework, demonstrating significant advantage in overcoming data constraints in robotic learning. Website: zh1hao.wang/Robo_MUTUAL

Related papers

RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete [27.814422322892522]
Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. They lack three essential robotic brain capabilities: Planning Capability, Affordance Perception, and Trajectory Prediction. We introduce ShareRobot, a dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. We develop RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizing a multi-stage training strategy.
arXiv Detail & Related papers (2025-02-28T17:30:39Z)
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation [47.41571121843972]
We introduce RoboMIND, a dataset containing 107k demonstration trajectories across 479 diverse tasks involving 96 object classes. RoboMIND is collected through human teleoperation and encompasses comprehensive robotic-related information. Our dataset also includes 5k real-world failure demonstrations, each accompanied by detailed causes, enabling failure reflection and correction.
arXiv Detail & Related papers (2024-12-18T14:17:16Z)
Generalized Robot Learning Framework [10.03174544844559]
We present a low-cost robot learning framework that is both easily reproducible and transferable to various robots and environments. We demonstrate that deployable imitation learning can be successfully applied even to industrial-grade robots.
arXiv Detail & Related papers (2024-09-18T15:34:31Z)
Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation [49.03165169369552]
By training a single policy across many different kinds of robots, a robot learning method can leverage much broader and more diverse datasets. We propose CrossFormer, a scalable and flexible transformer-based policy that can consume data from any embodiment. We demonstrate that the same network weights can control vastly different robots, including single and dual arm manipulation systems, wheeled robots, quadcopters, and quadrupeds.
arXiv Detail & Related papers (2024-08-21T17:57:51Z)
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments. We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model. Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z)
Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z)
RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking [54.776890150458385]
We develop an efficient system for training universal agents capable of multi-task manipulation skills. We are able to train a single agent capable of 12 unique skills, and demonstrate its generalization over 38 tasks. On average, RoboAgent outperforms prior methods by over 40% in unseen situations.
arXiv Detail & Related papers (2023-09-05T03:14:39Z)
VIMA: General Robot Manipulation with Multimodal Prompts [82.01214865117637]
We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts. We develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks. We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively.
arXiv Detail & Related papers (2022-10-06T17:50:11Z)
Lifelong Robotic Reinforcement Learning by Retaining Experiences [61.79346922421323]
Many multi-task reinforcement learning efforts assume the robot can collect data from all tasks at all times. In this work, we study a practical sequential multi-task RL problem motivated by the practical constraints of physical robotic systems. We derive an approach that effectively leverages the data and policies learned for previous tasks to cumulatively grow the robot's skill-set.
arXiv Detail & Related papers (2021-09-19T18:00:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.